Uploaded by joshgodosky

Econometrics Class Notes

advertisement
ECON 11020 FALL 2020
INTRODUCTION TO ECONOMETRICS
PABLO A. PEÑA
UNIVERSITY OF CHICAGO
pablo@uchicago.edu
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
1
1 Introduction
This manuscript constitutes the notes for the course. Why not just use an Econometrics
textbook? Because I want to make my teaching more effective and your learning more efficient.
In my experience, textbooks put too much emphasis on how the methods work instead of on how
those methods can be used. To use an analogy, textbooks are like a book about microwave ovens
explaining how electricity is transformed into microwaves, and how microwaves excite water
molecules inside food. A person using a microwave oven doesn’t need to know any of that to
make a good use of it. If the person knows what materials not to put inside the oven, and the
approximate relationship between cooking times and power, she can make the most out of the
oven safely and efficiently. The point here is that econometrics textbooks explain way more about
the methods than a regular practitioner needs to know in order to put those methods to good use.
These notes emphasize how regressions can be used.
There is a second argument against using econometric textbooks in this course. Since they are
written by academics, textbooks are biased towards the types of problems academics care the
most about—namely, establishing causal relationships. The type of questions academics
passionately pursue are only a subset of the type of questions practitioners are interested in. In
my experience, practitioners try to establish causal relationships less than 10% of the time. So,
these notes give a more general perspective of what can be done with regressions.
A third argument for an alternative look at how econometrics is taught for practitioners is the
growth in computer capacity. Back in the day—so I am told—running a regression was costly. A
person had to punch holes on a card to input data into a computer and wait to have a chance to
use a computer. People thought hard before running a regression. The fact that now we can draw
samples of our data thousands of times in a matter of minutes—if not seconds—has made feasible
the use of newer methods that are free from many unverifiable assumptions made in classic
theory. Computing power not only made empirical analysis easier. It also changed what we think
are more appropriate methods in practice. These notes will discuss some of those newer methods.
A fourth argument is the format. Textbooks require the reader to know matrix algebra and
probability theory. That depletes the attention of students. Keeping an eye on matrices or random
variable probability distributions prevents them from focusing on what really matters. Here we
keep those definitions to a minimum.
Lastly, the title of these notes could be “Empirical Analysis for Business Economics.” Each
part of that alternative title would illustrate an important aspect of what you will learn.
“Empirical” means we will refer to data collected in the real world. “Analysis” refers to the
statistical tools we will use. “Business” means we will consider practical questions of the kind
real organizations face out there. Lastly, “Economics” means that throughout we will keep our
perspective as economists—we are neither mathematicians nor statisticians. In sum, you will
learn how to apply statistical tools to data in order to answer relevant questions from an economic
perspective.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
2
1.1 Primacy of the question
A frequent answer to many questions in this course is “it depends on the question you want
to answer.” As a general rule, the methods we use must fit the question at hand. The question
should be carefully examined before jumping to figuring out how to answer it. In my experience,
pounding on a question is essential. Paraphrasing it multiple times and thinking what it is and
what it isn’t may be of great help. That is what I mean by the primacy if the question.
Assume there is arm-wrestling tournament with 149 participants. When two contestants face
each other, the winner advances and the loser is out. What is the total number of matches in the
tournament? Perhaps you had the impulse—like me—to create in your mind a bracket structure
adding rounds to reach a number close but below to 149. You can proceed that way and find the
solution. But it’ll take time and it won’t be general. What if the number of participants is 471 or
7,733? Carefully examining the setup of the question may give you the path of least resistance to
the answer. Here is a crucial piece of information. The tournament ends when all but one of the
participants are out. In every match, one participant is out. If there are N participants, there must
be a total N − 1 matches. This is a quick and general answer. It may take time to figure it out, but
once you do it, you can apply it more generally, to any number of participants.
In our context, we will always refer back to the question we have in mind. Sometimes the
question is impossible to answer. Some other times the answer is obvious, and no analysis is
needed. Most frequently, the question requires some polishing to be properly (or at least
reasonably) answered.
1.2 The cake structure
Our departure point is the origin of the term regression and the method it represents. Once we
establish the general idea of what a regression is, we will proceed according to what we can call
the three-layer cake structure. The first layer is the mathematics of regression, and it is about how
we compute a regression. It has the most formulas. The second layer is the probability content of
the regression results. In this layer we will talk about why we expect results to vary and what
information they convey. The third layer is the economics of regressions. We will learn the uses
and interpretations of the results.
The economics of regressions (the third layer) can be split into three slices that correspond to
three distinct uses: descriptive, predictive and prescriptive. In the descriptive use, regressions are
used to measure relationships accounting for other factors. This is useful when trying to judge to
what extent two things move together or not, or when comparing averages in equality of
circumstances. The predictive use is about knowing what to expect. We will explain how
predictions are different from forecasts. The third use is prescription, and we will decompose it
into the most common methods used by practitioners: randomized control trials, regression
discontinuity designs, and difference-in-differences.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
3
2 Regression
A fascinating question in natural and social sciences is the extent to which parental traits are
transmitted to children. In the late 19th century, Francis Galton worked on this topic. He
conducted surveys and collected data to analyze the relationship across many variables. One of
them was height. Using families with adult children, Galton computed the “mid-parent height”
(the average height of mother and father) and plotted it against the height of children. A stylized
version of the chart he produced is below. The horizontal axis represents parental height. The
vertical axis represents children height. The first thing to note is that there is a cloud of points.
The second is that there seems to be a positive relation. On average, taller parents have taller
children, and shorter parents have shorter children. How can we summarize this relationship?
Galton modelled the height of children as a linear function of parental height plus an error
term. If ๐‘ฆ๐‘– and ๐‘ฅ๐‘– denote the heights of person i and her parents, respectively, then Galton
assumed ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . Galton came up with a straight line that best fitted the cloud of points.
If the straight line has a positive slope, it means the relation is positive (tall parents have tall
children). If the slope is negative it means the relation is negative (tall parents have short
children). Lastly, if the slope is zero, then tall and short parents have children of the similar
stature. The graph below depicts the line that best fits Galton’s data (in blue), and also a line with
a slope of one as a benchmark. In the next section, we will discuss at length how Galton came up
with that line. Put very simply, we pick the intercept and the slope that minimize the square of
the vertical distance between the points in the cloud and the line.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
4
Galton found a positive relationship between the height of parents and children but didn’t
stop there. After all, it is evident that tall parents tend to have tall children. What interested Galton
the most was whether the slope of the line he produced was greater or smaller than one. If the
slope is greater than one it means differences in height across parents become even larger
differences in height across their children. Alternatively, if the slope is smaller than one,
differences in height across parents become smaller differences in the next generation.
Galton found that the slope is smaller than one. Therefore, having a tall ancestor doesn’t
matter much for the descendants. After a few generations, we expect the height of any family to
get closer to the mean. This process was called “regression to the mediocrity,” and now we call it
“regression to the mean.” The term regression was originally the description of this particular
result. It later became the name of the method used to find that result.
Today, regression is the workhorse of empirical economists. There is a wide variety of
regression models, but they all share the same essence. They all produce numbers that summarize
the relationships among variables in the data.
To analyze how regressions are used by practitioners, we will proceed according to our threelayer cake structure. The first layer is given by the mathematical aspects of regressions. Put
bluntly, the mathematical layer has no probabilistic or economic contents. This is very much all
algebra. But first, a short introduction to how regressions looks in practice.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
5
3 The mathematics of regression
3.1 The basics
3.1.1 A cloud of points
Our starting point is data. Usually, we think of data in tables or spreadsheets. We can
generally think of any data set as being organized in “variables” and “observations.” For instance,
if we have the expenditures in a given month of a group of customers at an online store, the unit
of observation may be the customer, and each customer would constitute one observation. In
addition to the expenditures, the information we have for each customer may include number of
purchases, number of returns, shipping costs, as well as age of the customer, gender, and zip
code. Those features of behavior and customer traits would be our variables. Each variable could
take different values, which could be continuous (like amounts in dollars or age in days) or
categorical (like gender or age in five-year ranges).
In the case of Galton, each adult child constitutes an observation, and his or her height
together with the height of his or her parents constitute our two variables. For instance, if we
have:
Person
Height of the person
Height of the parents
Robert
1.82
1.75
Anne
1.73
1.79
Cristopher
1.78
1.74
Laura
1.69
1.70
Charles
1.80
1.80
We can think more generally in terms of variables x and y, and the subscript i:
๐‘–
๐‘–=1
๐‘ฆ
๐‘ฅ
๐‘ฆ1 = 1.82
๐‘ฅ1 = 1.75
๐‘–=2
๐‘ฆ2 = 1.73
๐‘ฅ2 = 1.79
๐‘–=3
๐‘ฆ3 = 1.78
๐‘ฅ3 = 1.74
๐‘–=4
๐‘ฆ4 = 1.69
๐‘ฅ4 = 1.70
๐‘–=5
๐‘ฆ5 = 1.80
๐‘ฅ5 = 1.80
The variables in our data are not a chaotic mass of information. We usually have something
in mind that provides them with structure. We usually think that one variable is a function of the
other variables. For instance, Galton thought of children height as a function of parental height,
or ๐‘ฆ๐‘– = ๐‘“(๐‘ฅ๐‘– ). In math, we usually plot in the vertical axis the value of the function and we plot in
the horizontal axis the argument of the function. Thus, in the chart of height we saw before,
parental height is in the horizontal axis and children height is on the vertical axis.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
6
For the purpose of understanding how regressions work, we will think of our data visually,
as a cloud of points, like in Galton’s problem. The height of the points in the cloud represents the
height of children whereas the location of the points in the horizontal axis represent parental
height. The height of children is a function of the height of parents.
3.1.2 A model
In general, in a regression model like:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– .
We refer to ๐‘ฆ๐‘– as the dependent variable or left-hand side variable. We refer to ๐‘ฅ๐‘– as the
regressor, the explanatory variable, the independent variable, or the right-hand side variable.
Lastly, we usually refer to ๐œ€๐‘– (the Greek letter epsilon) as the error term or idiosyncratic shock.
Notice the subscript ๐‘–, which denotes the observation. In contrast, ๐‘ฅ and ๐‘ฆ denote variables,
whereas ๐‘ฅ๐‘– and ๐‘ฆ๐‘– denote the values that those two variables take in the case of observation ๐‘–.
The error term catches anything else unaccounted for in the model. Imagine in the case of
Galton’s analysis of height, it could include some phenotypic differences across individuals or
malnourishment of some children or parents.
In the model above, ๐‘ฆ๐‘– is expressed as a linear function of ๐‘ฅ๐‘– and an error term. Thus, ๐›ผ (the
Greek letter alpha) is the intercept and ๐›ฝ (the Greek letter beta) is the slope. The interpretation of
the slope is the expected change in ๐‘ฆ is associated with a change in ๐‘ฅ. Mathematically, the slope
is the partial derivative of ๐‘ฆ with respect to ๐‘ฅ:
๐œ•๐‘ฆ
๐œ•
=
(๐›ผ + ๐›ฝ๐‘ฅ) = ๐›ฝ
๐œ•๐‘ฅ ๐œ•๐‘ฅ
At the same time, we can interpret the model in terms of what we would expect. Suppose that
we are told the explanatory variable takes a value of ๐‘ฅ′. What is the corresponding value of the
dependent variable that we should expect to observe? The answer is:
๐‘ฆ ′ = ๐›ผ + ๐›ฝ๐‘ฅ′
These two ways of interpreting the model (the partial derivative and the conditional
expectation) are very useful and we will revisit them multiple times.
3.1.3 The minimization problem
In a nutshell, a regression simply fits a cloud of points with a straight line. To do that, it
minimizes the square of the vertical distance between each point in the cloud and the line. Notice that
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
7
it doesn’t minimize the vertical distance (we use the square) or the square of the distance (the
vertical part is crucial). Mathematically, to find the straight line that minimizes the square of the
vertical distance to the points in the cloud, we set up the following problem:
๐‘
min{๐‘Ž,๐‘} ∑
(๐‘ฆ๐‘– − ๐‘Ž − ๐‘๐‘ฅ๐‘– )2
๐‘–=1
We take first-order conditions with respect to ๐‘Ž and ๐‘:
๐‘
∑
๐‘ฆ๐‘– − ๐‘Ž − ๐‘๐‘ฅ๐‘– = 0
๐‘–=1
๐‘
∑
(๐‘ฆ๐‘– − ๐‘Ž − ๐‘๐‘ฅ๐‘– )๐‘ฅ๐‘– = 0
๐‘–=1
The first-order conditions above produce a linear system with two equations and two unknowns.
We can easily find the solution. Let us introduce some useful nomenclature. The solution are the
regression coefficients and we denote them with a “hat”, ๐›ผฬ‚ and ๐›ฝฬ‚. The fitted value of ๐‘ฆ๐‘– is:
๐‘ฆฬ‚๐‘– = ๐›ผฬ‚ + ๐›ฝฬ‚ ๐‘ฅ๐‘–
Notice the hat is also on ๐‘ฆฬ‚๐‘– . The difference between the fitted and actual values of ๐‘ฆ๐‘– is the
residual, which is also denoted with a hat:
๐‘ฆ๐‘– − ๐‘ฆฬ‚๐‘– = ๐œ€ฬ‚๐‘–
The residual ๐œ€ฬ‚๐‘– is an estimate of the error term ๐œ€๐‘– . It is convenient to establish the following
identities:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– = ๐‘ฆฬ‚๐‘– + ๐œ€ฬ‚๐‘– = ๐›ผฬ‚ + ๐›ฝฬ‚ ๐‘ฅ๐‘– + ๐œ€ฬ‚๐‘–
Going back to our minimization problem, it is easy to show that the first-order conditions
๐‘
imply that ∑๐‘
๐‘–=1 ๐œ€ฬ‚๐‘– = 0, and ∑๐‘–=1 ๐œ€ฬ‚๐‘ฅ๐‘– = 0. In words, the residuals average zero and the covariance
between the residual and the explanatory variable is zero.
The first order conditions can be arranged to provide formulas for the regression coefficients:
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐›ผฬ‚ = ๐‘ฆฬ… − ๐›ฝฬ‚ ๐‘ฅฬ…
๐›ฝฬ‚ =
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
8
Where ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ) represents the covariance between ๐‘ฅ and ๐‘ฆ, ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ) represents the variance of ๐‘ฅ,
and ๐‘ฅฬ… and ๐‘ฆฬ… are the averages of ๐‘ฅ and ๐‘ฆ, respectively. Notice the similarity between the regression
coefficient ๐›ฝฬ‚ and the correlation coefficient between ๐‘ฅ and ๐‘ฆ, which is usually denoted by ๐œŒ:
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฆ)
๐›ฝฬ‚ = ๐œŒ√
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
In other words, ๐›ฝฬ‚ is a re-scaled correlation coefficient. The factor for the re-scaling is a positive
number equal to the ratio of the standard deviation of ๐‘ฆ to the standard deviation of ๐‘ฅ.
Notice that the fitted value ๐‘ฆ๐‘– can be interpreted as the expected value of ๐‘ฆ conditional on ๐‘ฅ
taking a particular value, say ๐‘ฅ = ๐‘ฅ๐‘– :
๐ธ[๐‘ฆ|๐‘ฅ = ๐‘ฅ๐‘– ] = ๐›ผฬ‚ + ๐›ฝฬ‚ ๐ธ[๐‘ฅ๐‘– ] + ๐ธ[๐œ€ฬ‚] = ๐›ผฬ‚ + ๐›ฝฬ‚ ๐‘ฅ๐‘– = ๐‘ฆฬ‚๐‘–
In general, the regression coefficients can be interpreted as partial correlation coefficients (as
in “partial” derivatives), and the fitted values can be interpreted as conditional expectations. In
the case of Galton’s regression, ๐›ฝฬ‚ is interpreted as the difference in child height given a difference
of one unit in parental height, and ๐‘ฆฬ‚๐‘– = ๐›ผฬ‚ + ๐›ฝฬ‚ ๐‘ฅ๐‘– is interpreted as the expected height of a child
with parents of height ๐‘ฅ๐‘– . The following chart summarizes these concepts.
The orange points represent our cloud. The green points are the fitted values. They lie on the
regression line. The intercept of the line is the coefficient ๐›ผฬ‚. The slope is the coefficient ๐›ฝฬ‚. The
residual is the difference between the actual values and the fitted values.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
9
3.1.4 Multivariate regression
The world of univariate regressions (i.e. regressions with only one explanatory variable) is
very simple. However, we rarely run regressions with only one regressor. Most of the time we
use regressions with multiple regressors. They are known as multivariate regressions.
Multivariate regression models are usually expressed using different letters as variables and
different Greek letters as coefficients. For instance:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘ค๐‘– + ๐›ฟ๐‘ง๐‘– + ๐œ€๐‘–
For simplicity, if we have ๐‘˜ explanatory variables, we denote them by ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ . Notice that
we have ๐‘˜ + 1 regressors:
๐‘ฆ๐‘– = ๐›ฝ0 ๐‘ฅ0๐‘– + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜๐‘– + ๐œ€๐‘–
Where ๐‘ฅ0๐‘– = 1 for every ๐‘–. In other words, ๐‘ฅ0๐‘– is constant and its coefficient is the intercept. We
can express the regression in matrices and vectors:
๐›ฝ0
๐‘ฆ1
๐‘ฅ01 ๐‘ฅ11 ๐‘ฅ21 โ‹ฏ ๐‘ฅ๐‘˜1
๐œ€1
๐‘ฆ2
๐‘ฅ02 ๐‘ฅ12 ๐‘ฅ22 โ‹ฏ ๐‘ฅ๐‘˜2 ๐›ฝ1
๐œ€2
[ โ‹ฎ ] = [ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ ] ๐›ฝ2 + [ โ‹ฎ ]
๐‘ฆ๐‘
๐‘ฅ0๐‘ ๐‘ฅ1๐‘ ๐‘ฅ2๐‘ โ‹ฏ๐‘ฅ๐‘˜๐‘ โ‹ฎ
๐œ€๐‘
[๐›ฝ๐‘˜ ]
๐‘Œ = ๐‘‹๐›ฝ + ๐œ€
In this case we have a cloud of ๐‘ points in a k-dimensional space. We want to find the plane
or hyper-plane that minimizes the square of the vertical distance to those points. Using matrix
notation, we can write the minimization problem as:
min๐›ฝ (๐‘Œ − ๐‘‹๐›ฝ)′(๐‘Œ − ๐‘‹๐›ฝ)
Our ๐‘˜ + 1 first-order conditions can be expressed as:
๐›ฝฬ‚ = (๐‘‹ ′ ๐‘‹)−1 ๐‘‹′๐‘Œ
This formula involves a series of simple mathematical operations with the data. Keep in mind
that the vector ๐›ฝฬ‚ contains ๐‘˜ + 1 regression coefficients. Our results can be expressed as:
๐‘ฆฬ‚ = ๐›ฝฬ‚0 ๐‘ฅ0๐‘– + ๐›ฝฬ‚1 ๐‘ฅ1๐‘– + ๐›ฝฬ‚2 ๐‘ฅ2๐‘– + โ‹ฏ + ๐›ฝฬ‚๐‘˜ ๐‘ฅ๐‘˜๐‘–
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
10
We usually have ๐‘ > ๐‘˜ by a lot. Think about what would happen if ๐‘˜ + 1 = ๐‘. It’s very
helpful to start with the case of ๐‘ = 2. We would be trying to fit a cloud that consists of only two
points with a straight line. The fit would be perfect, and the residuals would be zero. Now extend
that idea to ๐‘ = 3. We would try to fit a cloud of three points with a plane. Again, the fit would
be perfect. This is a general result. As long as ๐‘˜ + 1 = ๐‘, the fitted values would be equal to the
actual values of ๐‘ฆ.
For illustrative purposes, we will use univariate or bivariate regression examples because we
can analyze them graphically. Their intuition extends to the case with more regressors. The graph
below shows a plane fitting a cloud of points in three dimensions (two explanatory variables and
one dependent variable). The plane cuts through the cloud, leaving some points above (in blue)
and other points below (in red).
3.1.5 Goodness of fit
Remember that we are trying to fit a cloud of points with a linear structure (a line, a plane or
a hyperplane). We can always measure how well we do that using the R-square (๐‘… 2), a measure
of goodness of fit. The formula is very simple:
๐‘…2 = 1 −
∑๐‘
ฬ‚๐‘– )2
๐‘–=1(๐‘ฆ๐‘– − ๐‘ฆ
∑๐‘
ฬ…)2
๐‘–=1(๐‘ฆ๐‘– − ๐‘ฆ
If our regression model fits the cloud perfectly, then all residuals are equal to zero and the Rsquare would be equal to one. If, on the contrary, the model is not better than using a flat line or
plane with a value of ๐‘ฆฬ…, then our regression model would not explain any of the variation in the
data, and the R-square would be equal to zero. As you can probably deduce, the R-square is
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
11
always between 0 and 1, with 0 representing the worst possible fit (none), and one representing
the perfect fit.
Why is the R-square called that way? One very intuitive way of measuring goodness of fit is
to compute the correlation between ๐‘ฆฬ‚ and ๐‘ฆ. If the model fits the data perfectly, the correlation
should be 1. If the model has a very poor fit, the correlation would be close to zero (positive or
negative). Let R stand for that correlation. How is the R-square related to R? Well, you probably
guessed it by now. The R-square is simply the square of R, that is, the square of the correlation
between ๐‘ฆฬ‚ and ๐‘ฆ.
The R-square is a mathematical concept. It is not informative of the probabilistic or economic
aspects of our regression. High R-squares are not per se better than low R-squares. The relevance
of different goodness of fit depends on the context. Later we will see some examples where the
R-square is not even mentioned (when we try to estimate causal effects) and other examples in
which the R-square is the most important aspect (when we try to predict). We will come back to
discuss goodness of fit as we advance in the course.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
12
3.2 Intermediate concepts
So far, we can say the mechanics of regression are simple. In fact, they are so simple that one
could be tempted to deem regression analysis as “too simplistic.” However, that misses several
points. Here we will go over some of them to give you a taste of the power of regression.
3.2.1 Dummies
Dummy variables (also known as indicator or dichotomic variables) are a very useful type
of regressors. A dummy takes a value of 1 if a condition holds true, and 0 if it doesn’t:
๐‘ฅ๐‘– = {
1
0
if condition holds
otherwise
Assume our regression model is ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . How do we interpret ๐›ผฬ‚ and ๐›ฝฬ‚ when ๐‘ฅ๐‘– is a
dummy? The following chart provides some guidance. If ๐‘ฅ๐‘– is a dummy, then our cloud of points
would consist of two columns of points. One would be located over the value of ๐‘ฅ๐‘– = 0 and the
other would be located over the value ๐‘ฅ๐‘– = 1. No points would lie between ๐‘ฅ๐‘– = 0 and ๐‘ฅ๐‘– = 1. Our
regression line would cross both columns. The graph below presents an example.
The resulting intercept and slope can be interpreted in terms of conditional expectations:
๐ธ[๐‘ฆฬ‚|๐‘ฅ = 0] = ๐›ผฬ‚
๐ธ[๐‘ฆฬ‚|๐‘ฅ = 1] = ๐›ผฬ‚ + ๐›ฝฬ‚
๐ธ[๐‘ฆฬ‚|๐‘ฅ = 1] − ๐ธ[๐‘ฆฬ‚|๐‘ฅ = 0] = ๐›ฝฬ‚
This is a very useful feature. Let’s move on to the case with two independent dummies. You
can imagine one dummy indicates gender (zero for male and one for female) and the other
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
13
indicates minority status (zero for non-minority and one for minority). There are four possible
combinations of (๐‘ฅ1 , ๐‘ฅ2 ): (0,0), (1,0), (0,1) and (1,1). In this case, the cloud of points consists of
four columns of points floating above or below those four coordinates. Our model is:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + ๐œ€๐‘–
Since we have two regressors, we can still get a visual interpretation of the plane that fits the
cloud of points. The following chart illustrates the cloud of points and the regression plane. The
height of the plane at each of the four coordinates (0,0), (1,0), (0,1) and (1,1) can be expressed in
terms of the beta hats.
In this example, ๐›ฝฬ‚1 < 0, ๐›ฝฬ‚2 > 0, and ๐›ฝฬ‚1 + ๐›ฝฬ‚2 < 0. Let’s assume for the illustrative purposes that
๐‘ฆ is wage and we are looking a group of employees of a company. The regression coefficients tell
us what the expected value of ๐‘ฆฬ‚:
The expected value the wage for…
Non-minority males
Non-minority females
Minority males
Minority females
…is:
๐ธ[๐‘ฆฬ‚|๐‘ฅ1 = 0, ๐‘ฅ2 = 0] = ๐›ฝฬ‚0
๐ธ[๐‘ฆฬ‚|๐‘ฅ1 = 1, ๐‘ฅ2 = 0] = ๐›ฝฬ‚0 + ๐›ฝฬ‚1
๐ธ[๐‘ฆฬ‚|๐‘ฅ1 = 0, ๐‘ฅ2 = 1] = ๐›ฝฬ‚0 + ๐›ฝฬ‚2
๐ธ[๐‘ฆฬ‚|๐‘ฅ1 = 1, ๐‘ฅ2 = 1] = ๐›ฝฬ‚0 + ๐›ฝฬ‚1 + ๐›ฝฬ‚2
There are many more ways of using dummies. We will learn more about them later.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
14
3.2.2 Splitting and warping
Sometimes our cloud of points doesn’t look linear. Can we still fit it with a linear structure?
The answer is affirmative. Imagine that our cloud of points looks like ๐‘ฆ is polynomial in ๐‘ฅ. That
is the case in the figure below.
Let’s start with a polynomial of degree โ„Ž in ๐‘ฅ:
๐‘ฆ = ๐‘Ž0 + ๐‘Ž1 ๐‘ฅ + ๐‘Ž2 ๐‘ฅ 2 + ๐‘Ž3 ๐‘ฅ 3 + โ‹ฏ + ๐‘Žโ„Ž ๐‘ฅ โ„Ž
If we define ๐‘ฅ0๐‘– = 1, ๐‘ฅ1๐‘– = ๐‘ฅ๐‘– , ๐‘ฅ2๐‘– = ๐‘ฅ๐‘–2 , ๐‘ฅ3๐‘– = ๐‘ฅ๐‘–3 , … , ๐‘ฅโ„Ž๐‘– = ๐‘ฅ๐‘–โ„Ž , then we arrive at:
๐‘ฆ๐‘– = ๐‘Ž0 + ๐‘Ž1 ๐‘ฅ1๐‘– + ๐‘Ž2 ๐‘ฅ2๐‘– + ๐‘Ž3 ๐‘ฅ3๐‘– + โ‹ฏ + ๐‘Žโ„Ž ๐‘ฅโ„Ž๐‘–
Which has a linear structure. All the regressors enter the model linearly (there aren’t any
quadratic, cubic, or higher degree terms). Thus, although a linear structure sounds restrictive, it
turns out that it isn’t. This is possible because we split and warp the regressors. In the case above,
๐‘ฅ is split into โ„Ž regressors, and each of them is warped differently. Although the original
relationship may be non-linear, we can find a specification with a linear relationship between ๐‘ฆ
and ๐‘ฅ, once we split it and warp it.
Notice that, in general, when we split and warp the regressors, the derivatives are no longer
constant. In the case above, we have:
โ„Ž
โ„Ž
โ„Ž
๐œ•๐‘ฅ๐‘—
๐œ•๐‘ฆ
๐œ•๐‘ฆ ๐œ•๐‘ฅ๐‘—
=∑
= ∑ ๐‘Ž๐‘—
= ∑ ๐‘Ž๐‘— ๐‘—๐‘ฅ ๐‘—−1
๐œ•๐‘ฅ
๐œ•๐‘ฅ
๐‘—=1 ๐œ•๐‘ฅ๐‘— ๐œ•๐‘ฅ
๐‘—=1
๐‘—=1
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
15
Graphically, to understand what happens when we split and warp, we can focus on the case
of a quadratic polynomial. Assume we have only one explanatory variable ๐‘ฅ, and the cloud of
points looks like a parabola that opens upward. Let’s split ๐‘ฅ into ๐‘ฅ1 = ๐‘ฅ and ๐‘ฅ2 = ๐‘ฅ 2 . The graph
below shows the parabolic relationship between ๐‘ฆ and ๐‘ฅ as blue points on the wall at the left. On
the floor you can see the relationship between ๐‘ฅ1 and ๐‘ฅ2 (the latter is the square of the former).
When we run a regression of ๐‘ฆ on ๐‘ฅ, we are choosing the right height and tilt of the blue plane to
fit the cloud of red points. The red points you see in the graph lie on the blue plane.
The takeaway is that, by splitting and warping our regressors, we can fit non-linear looking
clouds with linear structures.
3.2.3 Logarithms
Logarithm is a recurrent tool in economics because of its nice properties. Sometimes we use
logarithmic transformations of our regressors. For instance, we may be interested in the
regression model:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ln(๐‘ฅ๐‘– ) + ๐œ€๐‘–
If we take the derivative of ๐‘ฆ with respect to ๐‘ฅ and multiply it by a change in ๐‘ฅ equal to ๐‘‘๐‘ฅ, we
get:
๐œ•๐‘ฆ
๐‘‘๐‘ฅ
๐‘‘๐‘ฅ = ๐›ฝ1
๐œ•๐‘ฅ
๐‘ฅ
Assume ๐‘‘๐‘ฅ ≈ 1% of ๐‘ฅ. In this case, ๐›ฝ1 is interpreted as the change in ๐‘ฆ associated with a one
percent increase in ๐‘ฅ.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
16
Sometimes we use the logarithm of the dependent variable:
ln (๐‘ฆ๐‘– ) = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ๐‘– + ๐œ€๐‘–
The interpretation differs from the one in the previous example. To show it, let’s apply the
antilogarithm of the above expression, in other words, compute ๐‘’ ln (๐‘ฆ) :
๐‘ฆ = ๐‘’ ๐›ฝ0 +๐›ฝ1 ๐‘ฅ๐‘–+๐œ€๐‘–
In the above expression, the derivative of ๐‘ฆ with respect to ๐‘ฅ is:
๐œ•๐‘ฆ
= ๐›ฝ1 ๐‘’ ๐›ฝ0 +๐›ฝ1 ๐‘ฅ๐‘–+๐œ€๐‘–
๐œ•๐‘ฅ
If we divide by ๐‘ฆ we get:
1 ๐œ•๐‘ฆ
= ๐›ฝ1
๐‘ฆ ๐œ•๐‘ฅ
Thus, the coefficient ๐›ฝ1 can be interpreted as the change as a fraction of ๐‘ฆ associated with a change
in ๐‘ฅ. Notice that nothing prevents us from using this last model when ๐‘ฅ is a dummy variable. The
interpretation would be the same: ๐›ฝ1 is the change as a fraction of ๐‘ฆ associated with “turning on”
the dummy variable ๐‘ฅ.
3.2.4 Turning continues variables into dummies
Sometimes it’s more convenient to define a group of dummies to represent different intervals
of a continuous variable. For instance, instead age or income, we may want to have age groups
or income brackets. Assume ๐‘ง is an independent variable. Let:
๐‘ฅ0 = ๐Ÿ(๐‘ง ∈ [0, ๐‘Ž))
๐‘ฅ1 = ๐Ÿ(๐‘ง ∈ [๐‘Ž, ๐‘))
โ‹ฎ
๐‘ฅ๐‘˜ = ๐Ÿ(๐‘ง ∈ [โ„Ž, ๐‘–))
We have a regression model with a constant and ๐‘˜ dummies, one for each interval of ๐‘ง:
๐‘ฆ๐‘– = ๐›ฝ0 ๐‘ฅ0๐‘– + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜๐‘– + ๐œ€๐‘–
Using dummies this way may help us fit complicated pattern in the data in a very simple
manner. The graph below shows an example with an intercept and ๐‘˜ dummies:
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
17
3.2.5 Kinks and jumps
Sometimes we expect heterogeneity in the regression coefficients across subgroups. That
heterogeneity may come in the form of kinks or jumps. Formally, we say that there are
heterogenous coefficients. The graphs below present some examples. If we simply use a model
like ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– we would be missing the kink or the jump.
We can incorporate the possibility of kinks and jumps. To do that, let ๐‘‘๐‘– be such that:
1
๐‘‘๐‘– = {
0
if ๐‘ฅ๐‘– ≥ ๐‘Ž
if ๐‘ฅ๐‘– < ๐‘Ž
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
18
To include the possibility of heterogenous coefficients based on the value of ๐‘Ž, our model
would become:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘‘๐‘– + ๐›ฟ๐‘‘๐‘– ๐‘ฅ๐‘– + ๐œ€๐‘–
In this case, we say “the variable ๐‘‘ interacts with ๐‘ฅ๐‘– ” or that “there are interaction terms of ๐‘ฅ๐‘–
and ๐‘‘๐‘– .” Notice that now we have two intercepts and two slopes. Which is applicable depends on
whether ๐‘ฅ๐‘– ≥ ๐‘Ž or ๐‘ฅ๐‘– < ๐‘Ž. For ๐‘ฅ๐‘– < ๐‘Ž, the model is:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘–
Whereas for ๐‘ฅ๐‘– ≥ ๐‘Ž, the model is:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ + ๐›ฟ๐‘ฅ๐‘– + ๐œ€๐‘–
The intercept would be ๐›ผ + ๐›พ and the slope would be ๐›ฝ + ๐›ฟ. To show more clearly the
heterogeneity, we can write the model for both cases as:
๐‘ฆ๐‘– = (๐›ผ + ๐›พ๐‘‘๐‘– ) + (๐›ฝ + ๐›ฟ๐‘‘๐‘– )๐‘ฅ๐‘– + ๐œ€๐‘–
Assuming there is a kink and a jump at ๐‘Ž, the graph below shows how our model would fit
the cloud.
As an exercise, assume ๐‘ฆ๐‘– is wage, ๐‘ฅ๐‘– is years of schooling, ๐‘‘๐‘– = 1 if individual ๐‘– is female and
๐‘‘๐‘– = 0 otherwise. How would you interpret the coefficients in the following model?
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
19
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘‘๐‘– + ๐›ฟ๐‘‘๐‘– ๐‘ฅ๐‘– + ๐œ€๐‘–
3.2.6 Interactions
Some relationships between regressors and the dependent variable may be complicated. In
the regression context, we call interactions to the product of two or more regressors. We can have
interactions with dummies (as we saw before) or with any other regressors. Suppose we have two
independent variables, ๐‘ฅ1 and ๐‘ฅ2 . A model with an interaction between ๐‘ฅ1 and ๐‘ฅ2 is:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ1๐‘– + ๐›พ๐‘ฅ2๐‘– + ๐›ฟ๐‘ฅ1๐‘– ๐‘ฅ2๐‘– + ๐œ€๐‘–
If we take partial derivatives of the dependent variable with respect to each of the two
regressors, we don’t get constants terms. Instead, we get values that vary:
๐œ•๐‘ฆ๐‘–
= ๐›ฝ + ๐›ฟ๐‘ฅ2๐‘–
๐œ•๐‘ฅ1๐‘–
๐œ•๐‘ฆ๐‘–
= ๐›พ + ๐›ฟ๐‘ฅ1๐‘–
๐œ•๐‘ฅ2๐‘–
The slopes vary with the other explanatory variables. Like derivatives, the terms above can
be evaluated at different values of ๐‘ฅ1 and ๐‘ฅ2 . Since slopes vary across observations, we say they
are heterogenous. As you can imagine, we can have many types of interactions. They may involve
more than two independent variables. However, it is important to keep in mind that too many
interactions may obscure the meaning of our regression.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
20
3.2.7 De-meaning or centering
In some circumstances we may find convenient to de-mean the data, i.e. to center the data
around its mean. What does that do to our estimates? Consider the model ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . As
we saw before, the regression coefficients would be:
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐›ผฬ‚ = ๐‘ฆฬ… − ๐›ฝฬ‚ ๐‘ฅฬ…
๐›ฝฬ‚ =
What if instead we use ๐‘ฅ๐‘–∗ = (๐‘ฅ๐‘– − ๐‘ฅฬ… ) as our explanatory variable? The process of subtracting
the mean to create a new variable is called de-meaning or centering. If we did so, our model
would be ๐‘ฆ๐‘– = ๐›ผ ∗ + ๐›ฝ ∗ ๐‘ฅ๐‘–∗ + ๐œ€๐‘– . A natural question is, would ๐›ฝฬ‚ and ๐›ฝฬ‚ ∗ be the same? What about ๐›ผฬ‚
and ๐›ผฬ‚ ∗ ? Let’s compute them:
๐›ฝฬ‚ ∗ =
๐‘๐‘œ๐‘ฃ(๐‘ฅ ∗ , ๐‘ฆ) ๐‘๐‘œ๐‘ฃ(๐‘ฅ − ๐‘ฅฬ… , ๐‘ฆ) ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ)
=
=
= ๐›ฝฬ‚
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ ∗ )
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ − ๐‘ฅฬ… )
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
Thus, the slope is unchanged. But that’s not the case with the intercept:
๐›ผฬ‚ ∗ = ๐‘ฆฬ… − ๐›ฝฬ‚ (0) = ๐‘ฆฬ…
If we de-mean the regressors, then we can interpret our estimates as “evaluated at the mean.”
This is particularly interesting for the intercept, since it becomes the average for the dependent
variable. As an exercise, think what would happen if we also de-meaned the dependent variable.
3.2.8 Hierarchical and rectangular forms
Imagine we have yearly data on sales for three sales representatives. The data covers the years
2015 through 2018. There are (at least) two ways of structuring the data into a table. The first,
shown below, is what is known as a rectangular form or wide shape.
Sales representative
Sales
2015
Anne
2016
2017
2018
120
129
108
112
Bob
98
92
105
121
Chris
89
82
97
98
In this case, each row denotes a sales representative and the columns show the sales across
different years. Notice that, as we accumulate data for more years, the number of columns would
grow. A second way of presenting the same data is by using what is known as a hierarchical
form or long shape. Below is the same data but in hierarchical form. Notice that now each row is
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
21
a unique combination of sales representative and year, and there is only one column for sales. the
The first level of our hierarchy is given by the sales representative. The second level is given by
the year. In this case, there is only one column displaying the sales information. Adding more
years in this case would increase the number of rows.
Sales representative
Year
Sales
Anne
2015
120
Anne
2016
129
Anne
2017
108
Anne
2018
112
Bob
2015
98
Bob
2016
92
Bob
2017
105
Bob
2018
121
Chris
2015
89
Chris
2016
82
Chris
2017
97
Chris
2018
98
The data can come to you in many different shapes. You must be able to arrange it so that you
can analyze any way you desire. To do that, it’s helpful to keep in mind these two general ways
of organizing a table. Of course, when we have more complex data (more hierarchies and more
variables), there are more ways to organize them. Some ways could be partly hierarchical and
partly rectangular.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
22
4 Probability and regression
At this point, we already have the first layer of our cake structure. We have a mathematical
method (regression) to summarize the relationship between a dependent variable (๐‘ฆ) and a group
of independent variables (๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ ). We start with a cloud of points (our data), and we fit it
with a linear structure. We can fit clouds with all kinds of shapes. They don’t have to look like
lines or planes. They can be curvy, and they can have jumps and kinks. Now, we will proceed
to the second layer, which incorporates probability.
4.1 Sampling and estimation
Let’s start with a silly example. Assume I measure your expenditures on entertainment over
the last three months and plot it against the last two digits of your Social Security number. Would
there be any correlation? You can correctly guess there should be no correlation. The graph below
illustrates this example. The cloud represents different levels of expenditures along the vertical
axis, and the last two digits of the Social Security number in the horizontal axis.
We know the actual value of the slope should be zero because there is no reason those two
variables should be connected. That’s represented by the blue line. However, what if we
randomly got two samples like the ones depicted in the graph below? If our sample consisted of
the observations denoted by triangles, a regression using that sample would produce a negative
slope (red line). In contrast, if our sample consisted of the observations denoted by squares, the
regression would produce a positive slope (green line).
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
23
When we use samples, by sheer luck we may get positive or negative slopes even if the actual
value of the slope should be zero. Probability enters regressions trough the notion of sampling.
4.2 Nomenclature
Let’s introduce some useful nomenclature. We call parameters to the regression coefficients
we would get if we ran a regression using data for the entire population or universe. In contrast,
we call estimates to the regression coefficients we get when we run a regression using a sample.
Colloquially, parameters are sometimes are referred to as the betas or the true betas, whereas
estimates are referred to as the beta hats. The hat comes from the convention of adding the symbol
^ on top of the coefficient to distinguish it from the parameter. We hope the estimates are
informative of the parameters. In fact, that’s the only reason we care about them.
In the real world, we don’t observe the population or the universe. We only observe samples.
Our challenge is to determine if our estimates are close or far from the parameters. Notice
something important and intuitive in the example about the Social Security numbers that is true
more generally. First, the less ๐‘ฆ varies (relative to ๐‘ฅ), the smaller the chances of getting very
different regression coefficients across random samples—the beta hats would be more similar
across samples. Second, the larger the sample, the smaller the chances the regression coefficients
will differ by much from the population regression coefficient—the beta hat would be more
similar to the true beta. Those are two general principles worth keeping always in mind.
4.3 The magic of the Central Limit Theorem
Imagine that, given a population of size ๐‘€, we draw one million random samples of size
๐‘ < ๐‘€. For each sample, we run the regression ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– and get an estimate of beta
(that is, we gate a ๐›ฝฬ‚). We would have one million of such beta hats. If we create a histogram with
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
24
all those values, how would it look? By the Central Limit Theorem, we know it would look like
a normal distribution centered at the true beta. This property is independent of anything else. It
only depends on the concept of random sampling. This is an awesome result and we get a lot of
mileage out of it. The graph below shows how the one-million beta hat histogram would look.
In our Social Security number example we have that ๐›ฝ = 0 but, because of sampling, we
would get ๐›ฝฬ‚ > 0 half of the time and ๐›ฝฬ‚ < 0 the other half. However, estimates close to the
parameter are more likely than estimates far from it—look at the chart above. If we knew how
much ๐›ฝฬ‚ varies, then we could calculate the probability of ๐›ฝฬ‚ (the estimate) being close to or far
from ๐›ฝ (the parameter).
4.4 Standard error
We measure how much ๐›ฝฬ‚ varies using its standard deviation. We call the standard deviation
of ๐›ฝฬ‚ the standard error. There are two ways to estimate the standard error of ๐›ฝฬ‚. One way is
bootstrapping. It consists of treating our sample as the population, and then drawing many
samples from it, replacing each time with draw and observation all the data points. By taking
samples with replacement we can get a very good idea of how much our estimate varies based
exclusively on the luck of the draw. This is very easily done with today’s computers. Thus, there
is no excuse to not do it. We can select the number of repetitions we want (100, 1000, 10,000 or a
million). Notice that ceteris paribus larger sample sizes mean smaller standard errors because
larger samples sizes produce estimates closer to the parameter and therefore they vary less.
We can also proceed in the classic way and estimate the standard error using the residuals of
our (only one) regression using the full original sample. Based on assumptions that we won’t
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
25
review here (some of which aren’t verifiable), you can approximate the standard error this way.
It is important to know this method because most people use it. The standard error of ๐›ฝฬ‚ is
estimated based on how large or small the residuals are using the following formula:1
๐‘†. ๐ธ. (๐›ฝฬ‚ ) = √๐‘ฃ๐‘Ž๐‘Ÿ(๐›ฝฬ‚ ) = √๐œŽฬ‚ 2 (๐‘‹ ′ ๐‘‹)−1
The above expression also decreases with sample size through the term (๐‘‹ ′ ๐‘‹)−1 . To see it, notice
that, in the case of a univariate regression, multiplying by the term (๐‘‹ ′ ๐‘‹)−1 is equivalent to
2
dividing by the term ∑๐‘
๐‘–=1(๐‘ฅ๐‘– − ๐‘ฅฬ… ) , which is increasing in ๐‘. You may notice that we introduced
the term ๐œŽฬ‚ 2 :
๐‘
1
๐œŽฬ‚ = ∑ ๐œ€ฬ‚๐‘–2
๐‘
2
๐‘–=1
Whichever way we measure the standard error (bootstrapping using many regressions or
based on the residuals of a single regression), the idea is that our ๐›ฝฬ‚ is normally distributed with
mean ๐›ฝ and variance equal to the square of the standard error. We express that statement as:
๐›ฝฬ‚ ~๐‘(๐›ฝ, [๐‘†. ๐ธ. (๐›ฝฬ‚ )]2 )
Let’s assume the true beta is zero. This is an arbitrary but very useful assumption. Given a
standard error, we can compute the probability of ๐›ฝฬ‚ being in any interval we want. Let’s focus on
symmetric intervals around zero. The graph below shows the distribution of ๐›ฝฬ‚ assuming ๐›ฝ = 0,
and given a standard error of one (as an example). Given an interval [−๐‘Ž, ๐‘Ž], where ๐‘Ž is a positive
number, we can easily calculate the probability of ๐›ฝฬ‚ being outside that interval. We could do that
calculation in a spreadsheet or any statistical software.
There are better and slightly more sophisticated formulas to estimate the standard error that account
for some other factors. We will briefly discuss them later.
1
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
26
We can also proceed backwards. Start with a given probability. We can find the symmetric
interval around zero such that ๐›ฝฬ‚ would fall outside with that given probability. The graph below
illustrates that situation. If we start with a probability of, say, 0.05 of ๐›ฝฬ‚ falling outside of the
interval [−๐‘, ๐‘], then we can determine the value of ๐‘.
To summarize, estimates are sample regression coefficients and parameters are population
regression coefficients. Because of sampling, we think of estimates as random variables. Estimates
are normally distributed, and their mean are the parameters. The Central Limit Theorem doesn’t
require any assumption on the distribution of ๐‘ฆ, ๐‘ฅ or ๐œ€. Based on the Central Limit Theorem result,
we can formulate and test hypotheses.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
27
4.5 Significance
Once we know the shape of the distribution of the estimates (a normal distribution), we may
find useful to hypothesize that it is centered at zero. Another way to state the same hypothesis is
that there is no relation between ๐‘ฅ and ๐‘ฆ in the population or that the true beta is zero. However,
as we saw before, even if the true beta is zero, there is a chance we could get a sample for which
the beta hat is not zero. Thus, we never know with certainty if the hypothesis is true or false. But
we can check whether the data lend little or a lot of support to it.
We can test the hypothesis ๐›ฝ = 0 based on the distribution of ๐›ฝฬ‚ (under the assumption that is
centered at zero) and the actual sample coefficient we obtain. With those ingredients, we define
a rejection region associated with a confidence level (as you previously saw in your Stats course).
Keep in mind that, since there is uncertainty, the best we can do is to live with a level of confidence.
Sometimes it helps to understand these issues in terms of a coin. Suppose we’re interested in
determining whether a coin is fair (i.e. it isn’t loaded). By tossing it one hundred times (i.e. by
getting one sample of size ๐‘ = 100) we’ll never know for sure if it’s fair or not. But we may get a
very good idea. If out of one hundred tosses we get ninety-five heads, we have good reasons to
believe the coin isn’t fair. Why? Because, assuming the coin is fair, getting ninety-five heads or
more is extremely unlikely. What about eighty or more heads? Seventy or more? As we approach
fifty heads (what we would expect with a fair coin), the probability gets closer to fifty percent.
For instance, the probability of observing sixty heads or more is one in thirty-five (0.0284). Still
small, but not microscopic anymore. Lastly, the probability of observing fifty-five heads or more
is close to one in five (0.1841).
When we produce estimates using regressions, we have a similar situation. It’s hard to
reconcile estimates that are far from zero with a true parameter equal to zero. Just as in the coin
toss example, we can compute the probabilities associated with each value of ๐›ฝฬ‚. Remember that
once we know the standard error of ๐›ฝฬ‚, we also know the hypothetical distribution of ๐›ฝฬ‚ assuming
๐›ฝ = 0. The graph below shows such distribution. Notice that the location of the distribution
doesn’t depend on the value of ๐›ฝฬ‚—we assumed it’s centered at zero. What does the distribution
mean intuitively? Given ๐›ฝ = 0, values of ๐›ฝฬ‚ far from zero (be them positive or negative) are
unlikely. Thus, if our ๐›ฝฬ‚ is very large or very small, then it is highly unlikely that it comes from a
distribution centered at zero. Always keep in mind that the standard error is our measure of how
far is ๐›ฝฬ‚ from 0, because it is the standard deviation of the distribution of ๐›ฝฬ‚.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
28
Let’s revisit the normal distribution you’ve studied before. The graph below shows the
probability by intervals for a random variable that is normally distributed with mean zero and
standard deviation one (the horizontal axis is expressed in standard deviations). For instance, the
probability that such variable falls between 0 and 1 is 0.341. Since the distribution is symmetric,
the probability of the variable falling between −1 and 0 is also 0.341. Thus, the probability of
falling between −1 and 1 is 0.682, which is equal to 2 × 0.341. The probability of the variable falling
outside of the interval (−1,1) is 0.318, which is 1 – 0.682. More generally, we can compute the
probability of the variable falling inside or outside any interval we want.
We can also proceed the other way around. We can start with a probability, say 0.90 or 90%,
and find the symmetric interval that corresponds to that probability. An interval is defined by its
upper and lower bounds. The graph below shows the values of the upper and lower bounds
given three probabilities: 0.99, 0.95 and 0.90. As before, the horizontal axis is expressed in
standard deviations.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
29
Given the value of an estimate ๐›ฝฬ‚ (which may be positive or negative), we can compute the
probability of obtaining estimates (drawn from the same distribution centered at zero) that are
greater than |๐›ฝฬ‚ | or smaller than −|๐›ฝฬ‚ |. Such probability is known as the p-value associated with
the estimate ๐›ฝฬ‚. The graph below presents an example with ๐›ฝฬ‚ = 1.405 and a standard error of 1.
The probability of obtaining an estimate above 1.405 is 0.08, and the probability of obtaining an
estimate below −1.405 is also 0.08. Thus, the probability of obtaining an estimate that is farther
away from zero than 1.405 is 0.16, which is 2 × 0.08. In other words, the p-value of the estimate 1.405
is 0.16. It should be clear that the probability of getting an estimate closer to zero than 1.405 is 0.84.
The definition of p-value stated above corresponds to two-sided tests. We can define the pvalue for one-sided tests. In that case, we only care about either the probability of getting estimates
that are larger than our estimate or smaller than our estimate. As you can see in the graph above,
that’s equivalent to looking at only one of the tails of the distribution. The p-value in a right-side
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
30
test (which measures the probability of an estimate being greater than 1.4095) is 0.08. Because the
distribution is symmetric and centered at zero, that’s the same as the p-value in a left-side test
(which measures the probability of an estimate being smaller than −1.4095).
Now that we have reviewed some probability notions, we can introduce a crucial concept. If
ฬ‚
๐›ฝ falls outside of the 95% interval centered at 0, we say it is statistically significant (or statistically
different from zero) at 95% confidence. If it falls inside, we say that it is statistically insignificant
(or statistically not different from zero). We can use other levels of confidence. Traditionally, 95%
is the norm. However, with larger samples, we can be more demanding and use 99% or 99.9%.
Notice that the definition of statistical significance can also be expressed in terms of p-values. If
the p-value is below 0.05, then we say that the estimate is statistically significant at 95%
confidence. If the p-value is greater than 0.05, we say the estimate is insignificant or not
significant.
As you can imagine, the definition of significance can be adjusted to reflect one-side tests if
that’s what we need. Imagine that a regression produces an estimate ๐›ฝฬ‚ = 1.8 and the standard
error associated is 1. Is that estimate statistically significant? The answer depends on the level of
confidence we use and whether we are performing a one- or two-sided test. If we use a confidence
of 95% (or higher) in a two-sided test, the estimate is not significant (see the chart above). But if
we use 90%, it is significant, since it lies outside of the 90% interval that has bounds −1.64 and
1.64 (1.80 > 1.64). If we use one-sided tests, then the estimate would be significant at 95%
confidence because the interval’s bounds are −∞ and 1.64. In sum, you cannot say a priori whether
an estimate is significant or not just by looking at it. You need to know (1) whether we’re talking
about a one- or two-sided test, (2) the p-value of the estimate, and (3) the confidence level.
Notice that, all else constant, significance is directly affected by the sample size. We
mentioned that larger samples result in smaller standard errors. That means the distribution of
the estimates is more narrowly concentrated around the assumed value of the parameter. Thus,
any non-zero estimate will eventually result significant if we keep increasing the sample size.
So far, we’ve assumed ๐›ฝ = 0. However, we could assume any other value for ๐›ฝ and test whether
our estimate is likely to be coming from a distribution centered at that (non-zero) value. That
would be similar to assuming a loaded coin that lands heads with a probability different from
one half. Intuitively, given an estimate, some parameter values would be more “reasonable” than
others. After all, it’s more believable that the estimate ๐›ฝฬ‚ = 21.3 comes from a distribution centered
at 20 than from a distribution centered at 100. We will talk about this in the next two sections.
4.6 Confidence intervals
Given a confidence level, what parameters would be consistent with our estimate? We have
a Goldilocks situation. Some parameter values seem too big for our estimate, while others seem too
small. The graph below illustrates this situation. Imagine our estimate is ๐›ฝฬ‚, and we consider two
possible values of the true parameter, ๐›ฝ ∗ and ๐›ฝ ∗∗ . If the distribution of ๐›ฝฬ‚ were centered at any of
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
31
those two values, it would be very unlikely to get ๐›ฝฬ‚ (just like it’d be very unlikely to get fifty
heads in one-hundred tosses using a coin heavily loaded in favor of heads, or using another coin
heavily loaded against heads).
Which values of the parameters seem “right” given our estimate? The answer is very intuitive.
It’d be the values that are close to our estimate. Closeness to the estimate makes those parameters
appear more reasonable. One simple way to measure how close a possible parameter value is to
our (known) estimate is to look at the p-value we would get under the assumption that the true
parameter has any particular value.
Imagine we can adopt the following rule. We pick a confidence level, say 95%. Then we
determine all the values of ๐›ฝ for which the p-value of our estimate would be above the critical
value, which is defined as one minus the confidence level we picked. In this case the critical value
is 0.05. We would end up with an interval of possible values of ๐›ฝ. Colloquially speaking, it
wouldn’t surprise us if our estimate ๐›ฝฬ‚ came from a distribution centered anywhere within that
interval—the probability of such event wouldn’t be too small.
To make things easy, we can focus on the lower and upper bounds of the interval just
described. If the critical value is 0.05, we need to find the values of the parameter such that the pvalue of ๐›ฝฬ‚ is precisely 0.05. There are two such parameter values. One will be greater than ๐›ฝฬ‚ and
the other will be smaller. The graph below illustrates this point. If we assume the parameter is
equal to ๐›ฝ′, the p-value of ๐›ฝฬ‚ is 0.05. Similarly, if we assume the parameter is equal to ๐›ฝ′′, then the
p-value of ๐›ฝฬ‚ is also 0.05. For any parameter value between ๐›ฝ′ and ๐›ฝ′′, the p-value of ๐›ฝฬ‚ is greater
than 0.05.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
32
All possible betas for which the p-value of ๐›ฝฬ‚ is greater than (or equal to) 0.05 constitute the
95% confidence interval of our estimate. In the example above, the 95% CI (as we usually
abbreviate the confidence interval) is given by (๐›ฝ′, ๐›ฝ′′). In layperson terms, the confidence interval
tells us which values of the parameter are consistent with our estimate. Our estimate is not
statistically different from those parameter values (at the given level of confidence). This is very
helpful in many contexts.
A common mistake is to say that the parameter falls inside our confidence interval with a 95%
probability. Why is this wrong? The parameter is fixed. It isn’t a variable—let alone a random
one. Put differently, the parameter is either is in the interval or not. We cannot make probability
statements about it.
4.7 Hypothesis testing
Often times we would like to make decisions based on ๐›ฝ, but we don’t observe it. We only
observe ๐›ฝฬ‚. However, we know ๐›ฝฬ‚ and ๐›ฝ are related. First, ๐›ฝฬ‚ comes from a normal distribution
centered at ๐›ฝ. Second, we have a proxy for the standard deviation of that distribution—the
standard error of ๐›ฝฬ‚. Thus, we can use ๐›ฝฬ‚ as a piece of information about ๐›ฝ the same way we use a
sample average to inform us of the population average. How do we do this? We use hypothesis
testing.
A very common hypothesis (usually denoted by ๐ป๐‘œ) is ๐›ฝ = 0. If ๐›ฝฬ‚ is far from 0, then we reject
that hypothesis. However, we don’t reject it with certainty. We reject it with some level of
confidence picked a priori (usually 95%). When ๐›ฝฬ‚ is close to 0, we don’t reject the hypothesis.
However, not rejecting a hypothesis is different from accepting it. To illustrate that, imagine two
different hypotheses (e.g. ๐›ฝ = 0 and ๐›ฝ = 0.1) are tested using the same regression and aren’t
rejected. They both cannot be accepted because they are different (0 ≠ 0.1).
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
33
The measure of how far ๐›ฝฬ‚ is from ๐›ฝ is given by the standard error. However, we don’t know
the standard error with certainty. We estimate it based on our sample through bootstrapping or
the classic way based on the residuals. For our hypothesis tests we use a t distribution in lieu of
a normal distribution because we have a proxy for the standard deviation. The difference between
the estimate and the parameter, divided by the standard error, is a random variable distributed t
with ๐‘ − ๐‘˜ − 1 degrees of freedom:
๐›ฝฬ‚ − ๐›ฝ
∼ ๐‘ก๐‘−๐‘˜−1
๐‘†. ๐ธ.
Where ๐‘ is the number of observations, and ๐‘˜ is the number of explanatory variables. Whenever
we have more than one hundred degrees of freedom (which is almost always the case), the t
distribution is indistinguishable from a normal distribution. Hence the focus in these notes on the
latter. However, formally we use the t distribution for hypothesis testing and confidence
intervals. The ratio (๐›ฝฬ‚ − ๐›ฝ)/๐‘†. ๐ธ. is the t-statistic.
Knowing the distribution of the t-statistic allows us to formulate different hypothesis tests.
It’s crucial to note that the same estimate may be significant in some regressions but not in others,
depending on the standard error. In other words, the same hypothesis may be rejected with the
same estimate depending on the standard errors. Remember that significance is the result of
comparing the magnitude of the estimate with how much we would expect it to vary across
samples. For instance, assume two different standard errors, 0.5 and 1, and the hypothesis ๐›ฝ = 0.
The two distributions of ๐›ฝฬ‚ are depicted in the graph below. For which values of ๐›ฝฬ‚ do we reject
the hypothesis ๐ป0: ๐›ฝ = 0 in each case? The shaded areas denote the rejection regions at a 95%
confidence. Try different estimate values and convince yourself of the different conclusions.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
34
As the sample size increases, the size of the standard error decreases. To see this intuitively,
remember that estimates from larger samples resemble more the population parameter.
Therefore, there is less variation across estimates produced with larger samples. Thus, two
identical estimates may lead to different conclusions about the same hypothesis if those estimates
come from samples with different sizes.
With these tools we can test many hypotheses. We can hypothesize that ๐›ฝ takes any particular
value of interest to us (1.5, 3, −1.2, etc.) and test it. In this context, confidence intervals are very
useful. Given a level of confidence and an estimate ๐›ฝฬ‚, a confidence interval tells us all the values
for which we wouldn’t reject the hypothesis that the parameter is equal to any of those values.
Colloquially, a confidence interval gives us a range of parameter values of distributions from
which our estimate is likely to come from.
4.8 Joint-hypothesis tests
In the same regression, ๐›ฝฬ‚0 , ๐›ฝฬ‚1 , ๐›ฝฬ‚2 , … , ๐›ฝฬ‚๐‘˜ aren’t independent random variables. In general, they
are correlated. To see this, think of the original example using Social Security numbers and
expenditures on entertainment. We mentioned the possibilities of getting samples for which the
estimate of the slope would be positive or negative. Greater positive slopes come accompanied
with lower intercepts, whereas greater negative slopes come accompanied with greater
intercepts. We can formulate hypothesis tests that involve more than one estimate at a time. For
instance, we can test whether the sum of two estimates is equal to one, whether the ratio of two
estimates is equal to two, etc. Statistical software does that for us in an incredibly easy way. The
underlying ideas about significance and confidence intervals are the same.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
35
5 The economics of regression
So far, we’ve discussed the mathematical and probability aspects of regressions. Now we are
moving on to the economics. Always keep in mind that regression is a tool. How we should use
it depends on the question we are trying to answer. We can broadly classify most questions into
three uses: descriptive, predictive and prescriptive.
5.1 Descriptive use
The greatest asymmetry between econometrics textbooks and the practice of econometrics is
in the emphasis on describing the data. Explicitly or implicitly, econometric textbooks focus on
causal relationships and assume we already have a theory. In practice, the way we model
phenomena (how we think the situation under analysis works) comes after observing the data.
That isn’t cheating, as some theoretical extremist may suggest. It’s the scientific method. We first
observe the world, then we come up with ideas about how it works.
In business, we start with the overall goal of improving the performance of the organization
(reducing churn, increasing loyalty, reducing employee turnover, decreasing unused
promotions, etc.). Then we look at the data to get ideas. What seems to be associated with what?
Is churn associated with gender? Are older customers more loyal? Is turnover associated with
personality traits measured by the human resources department? Is the rate of unopened
promotional emails related to the time of day they are sent? The exploration of the world through
the data lenses allows us to find problems or areas of opportunity, and then come up with potential
solutions or ideas.
How do we explore the data? Visual inspection is usually insufficient or not feasible. We have
many variables and we cannot plot more than three dimensions at the same time. The analytical
“weapons of choice” for practitioners are partial correlation coefficients and conditional averages,
which are computed using regressions. The difference with regular or naïve correlations and
averages is that with partial correlation coefficients and conditional averages, we “hold everything
else constant,” “control for other factors,” or “adjust for other variables.”
Let’s start with the use of regression coefficients as partial correlation coefficients. The idea
is closely related to the mathematical concept of a partial derivative. Partial correlation
coefficients offer numerical answers to the question, what is the relation between ๐‘ฆ and ๐‘ฅ holding
everything else constant? Think in terms of the regression model:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘ค๐‘– + ๐›ฟ๐‘ง๐‘– + ๐œ€๐‘–
When we explore the relation between ๐‘ฆ and ๐‘ฅ, we want to hold constant ๐‘ค and ๐‘ง. If we simply
eyeball the data (say, with a scatter plot), we wouldn’t be holding constant ๐‘ค and ๐‘ง. With a
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
36
regression, when we look at the coefficient ๐›ฝ, by definition we are holding constant the other
regressors. Remember that:
๐›ฝ=
๐œ•๐‘ฆ
๐œ•๐‘ฅ
Suppose a restaurant chain is exploring the relationship between ticket size per customer (๐‘ฆ)
and party size (๐‘ฅ). The restaurant chain is entertaining the possibility of giving promotions to
increase party size because they believe larger parties spend more per customer. By running a
regression, they can get an estimate of ๐›ฝ and they can test whether is different from zero. The
regression may hold constant other variables (e.g. the day of the week, the time of the day or
whether there was a special event like Monday Night Football). They can also test hypothesis
about the dollar value of the increase associated with one additional person in the party. For
instance, they could formally test whether the increase is $5 (i.e. ๐›ฝ = 5).
The second workhorse of descriptive analysis is conditional averages. Regressions allow us
to calculate averages “adjusting for other factors” or “holding all else constant.” To illustrate the
relevance of this, suppose a company is comparing the productivity of managers supervising
different groups of employees (perhaps the comparison will be used to pay bonuses). Let ๐‘ฆ๐‘—๐‘–
represent the performance of employee ๐‘– in who works with manager ๐‘—. For each manager, we
have a unique group of workers. Let ๐‘ฆฬ…๐‘— represent the average performance of workers supervised
by manager ๐‘—. What are some potential issues with simply comparing average worker
performance across managers? In the real world, not all workers are the same. Some are more
motivated or more skillful. A naïve comparison of average performance across managers may
lead to wrong decisions.
Assume an expert tells you that worker performance is affected by work experience. Thus,
it’d be better to think in terms of the model:
๐‘ฆ๐‘—๐‘– = ๐œƒ๐‘— + ๐›ฝ๐‘ฅ๐‘—๐‘– + ๐œ€๐‘—๐‘–
Where ๐œƒ๐‘— is the productivity of the manager j, and ๐‘ฅ๐‘—๐‘– is the years of experience of worker ๐‘–. By
comparing averages without any sort of adjustment, we would be missing the effect of
experience, ๐‘ฅ๐‘—๐‘– , on the observed performance of each manager.
Take managers 1 and 2. If our model above is true, the naïve difference in average
performance is not ๐œƒ2 − ๐œƒ1 . Rather, it is:
๐‘ฆฬ…2 − ๐‘ฆฬ…2 = ๐œƒ2 − ๐œƒ1 + ๐›ฝ(๐‘ฅฬ…2 − ๐‘ฅฬ…1 ) + (๐œ€ฬ…2 − ๐œ€ฬ…1 )
As you can see, the naïve approach involves differences in manager productivity but also in
worker experience. Unless ๐‘ฅฬ…2 = ๐‘ฅฬ…1 , we would be omitting important information. If ๐›ฝ > 0, then
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
37
there would be a bias in favor of the manager with more-experienced workers. Let’s look at a
graphical version of the same example.
The graph below presents a cloud of points denoting the performance of different workers.
The different colors of the points in the cloud denote the different managers supervising each
worker. The blue points correspond to manager 1, and the orange points correspond to manager
2. If we simply computed average worker performance by manager, the average for manager 1
would be lower than the average for manager 2. However, by looking at the experience of all
workers, it is clear that manager 1 supervises workers with less experience than manager 2. The
shape of the cloud also suggests there is a positive relationship between worker performance and
experience. The lines in the graph represent the results of fitting the model ๐‘ฆ๐‘—๐‘– = ๐œƒ๐‘— + ๐›ฝ๐‘ฅ๐‘—๐‘– + ๐œ€๐‘—๐‘– ,
which has a different intercept for each manager and the same slope for worker experience. The
estimate of the intercept for manager 1 is ๐œƒฬ‚1 , and the estimate of the intercept for manager 2 is ๐œƒฬ‚2 .
Remember that those estimates can be interpreted as managerial productivity. In the graph, ๐œƒฬ‚1 >
๐œƒฬ‚2 , which means that, holding worker experience constant, manager 1 is more productive than
manager 2. The result is the opposite to the naïve comparison.
Similar examples are given by performance comparisons in many occupations (doctors with
patients with different challenges, teachers with students of different backgrounds, lawyers with
cases with different difficulties) or in prices of goods with many attributes (insurance premiums
for people with different characteristics, prices of cars or computers with different features, wages
for workers with different sociodemographic traits).
Examples like those above can be grouped into what we call hedonic models. The name
comes from pricing models where “the price of the total is the sum of the prices of the parts,”
even if those parts’ prices aren’t observed in the market. For instance, think of houses prices.
Being close to public transportation or having a backyard are valuable traits and certainly affect
the price of a house. However, you cannot buy those features in a market and add them to your
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
38
house. With hedonic models we can estimate the contribution of those traits to the total price as
if those traits could be added.
In sum, never use correlation or naïve comparisons of averages when you can use a
multivariate regression. Multivariate regressions allow you to control or adjust for other factors.
However, when running a regression, you must pay attention to what controls, covariates or
regressors are included in your analysis. It’s possible that sometimes you omit important
explanatory variables. Some other times you may be including too many. We’ll discuss those two
possibilities after we talk about fixed effects.
5.1.1 Fixed effects
In a regression model, fixed effects can be defined as different intercepts for different groups
of points. In our example of managers and workers above, we introduced manager-fixed effects.
All observations associated with one manager would share the same intercept, and those
intercepts could differ across managers. Fixed effects are estimates themselves. They are nothing
but coefficients on dummies.
Fixed effects can be used as controls (their value may be irrelevant to us) or as the subject of
our analysis (their value may be important to us). In the model above, we could be interested in
the relation of experience and worker performance. If we didn’t include manager fixed effects,
our fitted line would understate the actual relation. In that case, manager-fixed effects aren’t
interesting per se. We just use them to get the right estimate of a different parameter (๐›ฝ). If instead,
we are interested in measuring the difference managers make in worker performance, the
manager-fixed effects would be the most important result of the analysis.
To estimate fixed effects our cloud of points must include several observations associated to
the same unit. For instance, to estimate manager-fixed effects, we need multiple workers
associated with each manager. We also need to know the identity of their managers—otherwise
we cannot group observations by manager.
In the example above, each resulting coefficient ๐œƒฬ‚๐‘— is interpreted as the “manager effect.”
Depending on our subject of analysis, there may be also a “location effect,” “Holiday effect,”
“rush-hour effect,” and a long et cetera (notice that, for brevity, we omitted the word “fixed”) .
In terms of notation, fixed effects can be written very concisely. Imagine that we have a dayof-the-week fixed effect. We can denote it by ๐œ‚๐‘‘ (the Greek letter eta with a subscript indicating
the day):
๐‘ฆ๐‘– = ๐œ‚๐‘‘ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘–
In that case, the subscript ๐‘‘ would take seven possible values, from Sunday through Saturday.
Compare that to the equivalent dummy approach, where we would have one coefficient and one
dummy for each day of the week:
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
39
๐‘ฆ๐‘– = ๐œ‚๐‘†๐‘ข ๐‘‘๐‘–๐‘†๐‘ข + ๐œ‚๐‘€ ๐‘‘๐‘–๐‘€ + ๐œ‚ ๐‘‡๐‘ข ๐‘‘๐‘–๐‘‡๐‘ข + ๐œ‚๐‘Š ๐‘‘๐‘–๐‘Š + ๐œ‚ ๐‘‡โ„Ž ๐‘‘๐‘–๐‘‡โ„Ž + ๐œ‚๐น ๐‘‘๐‘–๐น + ๐œ‚๐‘†๐‘Ž ๐‘‘๐‘–๐‘†๐‘Ž + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘–
Clearly, it’s better to use the fixed effects notation rather than the dummy one, particularly
when we have large numbers of fixed effects.
Lastly, we can have two or more fixed effects in the same model. For instance, in the same
regression we can have fixed effects for day of the week and fixed effects for the hour of the day
(say, morning, midday, afternoon and evening). The model could be written as:
๐‘ฆ๐‘– = ๐œ‚๐‘‘ + ๐œƒโ„Ž + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘–
Fixed effects are very useful and not very well understood by many practitioners.
Paradoxically, they are incredibly easy to work with in practice. Also, they seem to create
information out of thin air. After all, without any direct information about managers, we are able
to measure (otherwise unobserved) differences in productivity. The intuition for this is that we
get indirect information though the multiple workers supervised by each manager.
5.1.2 Omitted variables
When a variable belongs in a model and we omit it, we create a bias. To show it, let’s start by
assuming the correct model (without omissions) and contrast its results with what we get with
the omission. Suppose the correct model is ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . Our estimate of ๐›ฝ is:
๐›ฝฬ‚ =
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
If our model is correct, we can substitute ๐‘ฆ with ๐›ผ + ๐›ฝ๐‘ฅ + ๐œ€ in the formula above. After some
algebra, and using the properties of covariance, we get that the expected value of our estimate is
the parameter:2
๐ธ[๐›ฝฬ‚ ] =
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฆ) ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐›ผ + ๐›ฝ๐‘ฅ + ๐œ€) ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐›ผ) + ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐›ฝ๐‘ฅ) + ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐œ€)
=
=
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐›ฝ๐‘ฅ)
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
=
=๐›ฝ
=๐›ฝ
=๐›ฝ
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
This equality holds with expected values. As you know, when we talk about a particular sample, the
estimate will likely not be equal to the parameter.
2
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
40
In this case, we say that the estimate is unbiased. The expectation of ๐›ฝฬ‚ is ๐›ฝ. In reality, we
cannot be certain about what the correct model is. But contemplating the possibility of omitting
relevant variables is important.
Let’s go back to our example of party size and average ticket in a restaurant. What can be
missing from the analysis? We can think of many determinants of ticket size per customer besides
party size. An obvious one is socioeconomic status—there can be many others. Let’s think of the
model ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘ง๐‘– + ๐œ€๐‘– , where ๐‘ฆ๐‘– is the ticket size per customer of party ๐‘–, ๐‘ฅ๐‘– is the party ๐‘–’s
size, ๐‘ง๐‘– is the socioeconomic status of the person paying the check (perhaps measured by the type
of payment). How is party size related to expenditure per customer?
Holding all else constant, ๐œ•๐‘ฆ/๐œ•๐‘ฅ = ๐›ฝ. If we could run the regression ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘ง๐‘– + ๐œ€๐‘– ,
we would obtain estimates for the three parameters. However, when we run a regression of ๐‘ฆ on
๐‘ฅ alone (omitting ๐‘ง), what do we get? Let’s look at the expected value of our estimate of ๐›ฝ:
๐ธ[๐›ฝฬ‚ ] =
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐›ผ + ๐›ฝ๐‘ฅ + ๐›พ๐‘ง + ๐œ€)
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ง)
=๐›ฝ+๐›พ
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ)
The term ๐›พ × ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ง)/๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ) is the omitted-variable bias. In words, by omitting ๐‘ง from the
regression, our estimate of ๐›ฝ is biased. What can we say about the sign and magnitude of the
omitted-variable bias? The bias depends on: (i) the coefficient on the omitted variable, in this case
๐›พ, and (ii) the covariance between included and omitted regressors, in this case ๐‘ฅ and ๐‘ง. The
following table goes over all the possibilities.
๐›พ
๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ง)
Omitted variable bias is…
0
Any value
Zero
Any value
0
Zero
>0
>0
Positive
<0
>0
Negative
>0
<0
Negative
<0
<0
Positive
What does the table imply for the analysis of ticket size? The analysis is omitting
socioeconomic status. The sign of the bias depends on whether socioeconomic status increases or
decreases average tickets size and how it relates to party size. to develop your intuition, go over
several possibilities.
Economists always think of potential omitted-variable bias when they look at correlations or
regression coefficients. How does omitted-variable bias look in practice? Sometimes we have
other potential regressors. We simply include them and see what happens. If there is no omittedvariable bias, adding regressors doesn’t change our estimates in a meaningful way. If there is
omitted-variable bias, adding regressors changes our estimates. When we don’t have other
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
41
regressors, economic theory may be informative of the sign or even the magnitude of the omittedvariable bias.
Let’s revisit the manager productivity example above. Imagine we are interested in the
relation between worker performance and experience. If we use a model without manager-fixed
effects (i.e. with one intercept), the slope we would get would be smaller than if we use a model
with manager-fixed effects (i.e. with multiple intercepts, one for each manager). The omission of
manager dummies as explanatory variables bias downward the estimate of the relation between
worker performance and experience.
5.1.3 Redundant variables
In a regression, what happens if two regressors are very much measuring the same? This is
called collinearity. If collinearity is perfect, then two or more regressors have a linear relationship.
In the case of two regressors, which we can represent prefect collinearity with the equality:
๐‘ฅ2๐‘– = ๐›ฟ0 + ๐›ฟ1 ๐‘ฅ1๐‘–
With perfect collinearity, we cannot include both ๐‘ฅ1 and ๐‘ฅ2 as regressors in our regression. Notice
what would happen if we did:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + ๐œ€๐‘–
= ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 (๐›ฟ0 + ๐›ฟ1 ๐‘ฅ1๐‘– ) + ๐œ€๐‘–
= (๐›ฝ0 + ๐›ฝ2 ๐›ฟ0 ) + (๐›ฝ1 + ๐›ฟ1 )๐‘ฅ1๐‘– + ๐œ€๐‘–
= ๐›พ0 + ๐›พ1 ๐‘ฅ1๐‘– + ๐œ€๐‘–
This is equivalent to dropping ๐‘ฅ2 from the regression (we could have dropped ๐‘ฅ1 and keep only
๐‘ฅ2 instead). In fact, statistical software automatically does it for us. The question remains, can we
recover estimates of ๐›ฝ0 and ๐›ฝ1 from estimates of ๐›พ0 and ๐›พ1 ? The answers is negative. To see why,
let’s consider an example. Think of ๐‘ฅ1 as temperature in degrees Celsius (°๐ถ) and ๐‘ฅ2 as
temperature in degrees Fahrenheit (°๐น). Notice that there is perfect collinearity, since °๐น = 32 +
1.8°๐ถ. We can estimate the effect of temperature measured in degrees Celsius or Fahrenheit, but
we cannot estimate the effect of one holding the other constant—it’d be meaningless.
What about cases in which collinearity isn’t perfect but high? Imagine, that the correlation is
greater than 0.8. In this case, the coefficients on the collinear regressors “dilute.” Jointly, they may
be significant. But each of them (or at least some of them) may be not. Think of income and wealth,
or grit and conscientiousness. To some extent, they measure the same. Coefficients become hard
to interpret as partial derivatives. If we include variables that measure similar things, we should
be explicit about this issue. Whenever possible, the inclusion of regressors must be informed by
theory. We should ask ourselves the question, do we truly believe these regressors belong in the
regression?
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
42
5.1.4 Dummies and redundancy
In many instances, dummy variables result in redundancy. In other words, they show perfect
collinearity. That’s not necessarily a problem. Consider the following example. In a questionnaire,
you are asked to mark yes or no:
Yes
No
๏‚
๏‚
๏‚
๏‚
๏‚€
๏‚€
๏‚€
๏‚€
๏€ 
๏€ 
๏€ 
๏€ 
Female
Minority
College degree
Over 65 years of age
The sum of these four dummies ranges between 0 and 4. Like switches, they can be turned on
(1) or off (0) independently of each other. We can imagine people in each of the 16 possible
combinations. These dummies are independent.
Now consider a questionnaire that includes the following questions:
๏€ 
๏€ 
๏€ 
๏€ 
๏€ 
๏€ 
Yes
No
๏‚
๏‚
๏‚€
๏‚€
Yes
No
๏‚
๏‚
๏‚
๏‚
๏‚€
๏‚€
๏‚€
๏‚€
Yes
Female
Male
๏€ 
๏€ 
๏€ 
White
Black
Hispanic
Other
Yes
๏€ 
๏€ 
๏€ 
No
๏‚ ๏‚€
๏‚ ๏‚€
๏‚ ๏‚€
High school or less
Some college
College degree
No
๏‚ ๏‚€
๏‚ ๏‚€
๏‚ ๏‚€
0 to 30 years of age
31 to 65 years of age
Over 65 years of age
Notice that you can only mark female or male, and therefore the sum of the first two dummies
is always equal to one. The sum of the next four dummies for race/ethnicity is always equal to
one. The same can be said about the dummies for educational attainment and age. That’s because
within each group those dummies are associated with mutually exclusive categories. Hence, they
are not independent. If you are in one category, you must not be in another. They are dependent.
Other categories may be nested. The following question asks you for your place of birth.
Whenever the dummy for Chicago is 1, the dummies for Illinois and U.S. are also 1, and the
dummies for Outside of Chicago, Outside of Illinois and Outside of the U.S. are 0. Clearly, those
dummies are not independent. They aren’t perfectly collinear either—their sum isn’t always the
same. However, there is redundant information. If you were born in Chicago, then you weren’t
born outside of Chicago, Illinois or the U.S. Within subsets, some dummies are perfectly collinear.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
๏€ 
๏€ 
๏€ 
๏€ 
๏€ 
๏€ 
Yes
No
๏‚
๏‚
๏‚
๏‚
๏‚
๏‚
๏‚€
๏‚€
๏‚€
๏‚€
๏‚€
๏‚€
43
Outside of the U.S.
U.S.
Illinois
Outside of Illinois
Chicago
Outside of Chicago
Consider the alternative version of the same question about your place of birth:
Yes
No
๏€  ๏‚
๏€  ๏‚
๏€  ๏‚
๏‚€
๏‚€
๏‚€
U.S.
Illinois
Chicago
It should be clear that the information elicited is exactly the same. The second version of the
question eliminated all redundancy, but the dummies remain dependent. That’s a result of them
being nested. Chicago is in Illinois, and Illinois is in the U.S.
Regardless of whether we have dependent or independent dummies, we must pay attention
to perfectly collinearity. Remember that when there is an intercept in our model there is regressor
๐‘ฅ0 equal to 1. Assume ๐‘ฅ1 and ๐‘ฅ2 are perfectly collinear dummies. Since ๐‘ฅ1๐‘– + ๐‘ฅ2๐‘– = 1, then we have
that ๐‘ฅ1๐‘– + ๐‘ฅ2๐‘– = ๐‘ฅ0๐‘– . Thus, we cannot estimate the model:
๐‘ฆ๐‘– = ๐›ฝ0 ๐‘ฅ0๐‘– + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜๐‘– + ๐œ€๐‘–
We must drop either ๐‘ฅ0 , ๐‘ฅ1 or ๐‘ฅ2 from our regression. If we drop ๐‘ฅ1 , our model becomes:
๐‘ฆ๐‘– = ๐›พ0 + ๐›พ2 ๐‘ฅ2๐‘– + ๐œ€๐‘–
Alternatively, by dropping the constant (๐‘ฅ0 ), our model becomes:
๐‘ฆ๐‘– = ๐›ฟ1 ๐‘ฅ1๐‘– + ๐›ฟ2 ๐‘ฅ2๐‘– + ๐œ€๐‘–
But, since ๐‘ฅ1๐‘– = 1 − ๐‘ฅ2๐‘– , we can write as:
๐‘ฆ๐‘– = ๐›ฟ1 (1 − ๐‘ฅ2๐‘– ) + ๐›ฟ2 ๐‘ฅ2๐‘– + ๐œ€๐‘–
= ๐›ฟ1 + (๐›ฟ2 − ๐›ฟ1 )๐‘ฅ2๐‘– + ๐œ€๐‘–
The models above are equivalent. If we look at the conditional expected value of the dependent
variable, we get:
๐ธ[๐‘ฆ|๐‘ฅ1 = 1, ๐‘ฅ2 = 0] = ๐›พ0 = ๐›ฟ1
๐ธ[๐‘ฆ|๐‘ฅ1 = 0, ๐‘ฅ2 = 1] = ๐›พ0 + ๐›พ2 = ๐›ฟ2
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
44
When we have perfectly collinear dummies, there are multiple equivalent ways to formulate
our model. The results are exactly the same, but they are stated differently. Think of the wage
gender gap. Assume our dependent variable is wage. We could include as regressors a constant
and a dummy for female. The intercept would tell us the average wage among males, and the
coefficient on the dummy would tell us the female minus male gender gap. Alternatively, we could
substitute the dummy for female with a dummy for male. In this case, the intercept would tell us
the average wage among females, and the coefficient on the dummy would tell us the male minus
female gender gap. Lastly, we could exclude the constant and include a dummy for male and a
dummy for female. Now the coefficient on the dummies would tell us the average wages for
males and females, respectively. The difference would be the gender gap. The information these
three models provide is exactly the same, just arranged differently.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
45
5.1.5 Measurement error
Frequently, our measurements aren’t accurate. In other words, we have measurement error.
Not every kind of measurement error is equally interesting or relevant. If the error is always equal
to a constant positive number, then it means we always overstate the true value of our variable
by that number. If the constant number is negative, then it means we understate the true value.
Those cases aren’t very interesting because measurement error simply shifts the cloud of points
up, down or sideways but it doesn’t affect its shape. The most interesting kind of measurement
error is the one that doesn’t systematically inflate or deflate our measurements, but it makes them
noisy. Sometimes it’s called classical measurement error. Height, income, IQ are examples of
variables that can have this type of noise. What are the effects of measurement error on our
estimates? Put simply, it depends. The most important lesson we’ll learn is that measurement
error in the regressors causes attenuation bias, which means that the regression coefficients are
biased towards zero.
Graphically, measurement error in the explanatory variable stretches horizontally our cloud of
points. The figure below shows a very simple example. Imagine we start with the cloud of points
given by the two solid positions. The regression in absence of measurement error is denoted by
the solid line. With measurement error in ๐‘ฅ, the cloud would look like the hollow points. Given
the height in the cloud, some hollow points would be shifted to the right of the solid points while
others would be shifted to the left, but their average location would be given by the solid points.
The dashed line denotes the regression in presence of measurement error. Horizontally stretching
the cloud flattens the slope of the regression.
To show this algebraically, assume ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . Instead of ๐‘ฅ๐‘– , we observe ๐‘ฅ๐‘–∗ = ๐‘ฅ๐‘– + ๐‘ข๐‘– ,
where ๐‘๐‘œ๐‘ฃ(๐‘ฅ, ๐‘ข) = 0. This zero covariance means that measurement error (denoted by ๐‘ข) isn’t
associated with ๐‘ฅ in any systematic way. The expected value of our estimate of ๐›ฝ is:
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
๐ธ[๐›ฝฬ‚ ] =
46
๐‘๐‘œ๐‘ฃ(๐‘ฅ ∗ , ๐‘ฆ) ๐‘๐‘œ๐‘ฃ(๐‘ฅ + ๐‘ข, ๐‘ฆ)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ข)
=
=
๐›ฝ
(
)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ ∗ )
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ + ๐‘ข)
๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ฅ) + ๐‘ฃ๐‘Ž๐‘Ÿ(๐‘ข)
Notice that the term in parentheses is always in the interval (0,1) because all terms inside are
positive. That means with measurement error we expect ๐›ฝฬ‚ to be somewhere between 0 and ๐›ฝ (i.e.
closer to zero). It’s important to note that the sign of the attenuation bias depends on the sign of
the coefficient. It’s negative when the parameter is positive and vice versa. The t-statistic of our
estimate is also biased towards zero, which means we would be more likely to not find
significance.
A different situation is when we have measurement error in the dependent variable. In this
case there is no bias, but we experience loss of precision. Our cloud is vertically stretched, which
results in larger standard errors. The figure below provides a simple example. The solid points
show the situation without measurement error. The line represents the regression line in that
case. The hollow points constitute the cloud with measurement error in ๐‘ฆ. Some are shifted up
and some are shifted down relative to where they should be. Their average vertical position is
unaltered. Thus, the regression line is the same as without measurement error. However, it
should be born in mind that, if we took many samples, now we can get different slopes and
therefore the standard error is larger.
The fact that measurement error in the explanatory variable attenuates the estimates is
important. That means that, in absence of measurement error, the estimates would have a greater
magnitude and smaller p-values. In other words, if you get significant coefficients in a regression
and someone has a hard time believing your results arguing that there is measurement error
(perhaps not with those words) then you can reply that, if there is measurement error, getting rid
of it would make your coefficients larger and more significant.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
47
5.2 Predictive use
Before we get started, let’s make a distinction between forecast and prediction. When we
forecast, we determine what the future will bring, only conditional on time passing. When we
make a prediction, we come up with what we expect ๐‘ฆ to be, assuming we know ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ .
Most businesses rarely make forecasts. They routinely make predictions—some good and some
bad. Let’s imagine attendance to a chain of gyms can be modeled as:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1๐‘– + ๐›ฝ2 ๐‘ฅ2๐‘– + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜๐‘– + ๐œ€๐‘–
Assume the dependent variable is defined as the number of days attended over the course of 12
months after joining the gym. Assume the regressors ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ are observed or reported at the
moment of signing up—i.e. before attendance occurs. They include age, body mass index, gender,
marital status, educational attainment, etc. We can estimate the ๐›ฝ1 , ๐›ฝ2 , … , ๐›ฝ๐‘˜ using all members
who signed up in January 2019. We would have 12 months of data for each of them (up to
December 2019).
Suppose a new member ๐‘— signs up. We observe ๐‘ฅ1๐‘— , ๐‘ฅ2๐‘— , … , ๐‘ฅ๐‘˜๐‘— for her. Our prediction of
attendance given her age, body mass index, gender, and so on, is:
๐‘ฆฬ‚๐‘— = ๐›ฝฬ‚0 + ๐›ฝฬ‚1 ๐‘ฅ1๐‘— + ๐›ฝฬ‚2 ๐‘ฅ2๐‘— + โ‹ฏ + ๐›ฝฬ‚๐‘˜ ๐‘ฅ๐‘˜๐‘—
In other words, our predicted value or prediction is a fitted value for some values of the regressors.
When we try to predict, we pay little or no attention to each regression coefficient or to their
significance. We only care about the fit. How can we know if our predictions are good? A higher
R-square (our measure of goodness of fit) means a better prediction. Also, a narrower confidence
interval around the prediction means more accuracy. The graphs below show examples with
different R-squares and different confidence intervals. By definition, greater residuals mean
worse predictions. The R-square captures those larger residuals.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
48
In practice, we usually start with two data sets. The first of them is retrospective. It consists
of the ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ observed before the fact we care about took place (e.g. gym member
characteristics at sign up in January 2019, before attendance occur), and the ๐‘ฆ observed after that
fact (e.g. gym attendance over the course of 2019). The second data set is prospective. We only
observe the ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ before the fact (e.g. characteristics for gym members signing up in January
2020 and for which we haven’t observed attendance because it hasn’t occurred yet). We pool the
two data sets together. Notice that the dependent variable ๐‘ฆ is missing for the prospective
observations. Then, we run our regression, which will only include the retrospective data. Lastly, we
use the fitted model (i.e. ๐›ฝฬ‚1 , ๐›ฝฬ‚2 , … , ๐›ฝฬ‚๐‘› ) to compute the gym attendance we expect for each new
member, given her values of ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘˜ . In practice, this is very easy—it can be done in three
lines of code. The important part is understanding the logic.
We can compute statistics using the predicted values for the prospective observations (mean,
variance, proportion greater than a threshold, etc.). Some examples where regressions are used
to predict important variables are: consumer lifetime contribution of newly acquired customers,
performance of potential new hires, credit scores, admissions, fraud detection, and consumer
behavior in platforms like Netflix, Amazon or Spotify. Can you imagine how?
When we are predicting, it’s important to distinguish between two types of predictions. One
of them is within-sample predictions, which is when the values of the prospective ๐‘ฅ’s fall inside
the range of the retrospective ๐‘ฅ’s. The other type is out-of-sample predictions, which is when the
values of prospective ๐‘ฅ’s fall outside of the range of the retrospective ๐‘ฅ’s. There isn’t much cause
for concern when we make within-sample predictions. However, when we make out-of-sample
predictions, our model could be flat out incorrect. To see it, imagine extrapolating any behavior
(drinking, dating, working) based on customer age when your retrospective data only includes
people between the ages of 15 and 25. What would happen if you try to predict the same behavior
for five-year-olds? How about sixty-year-olds? Intuitively, predictions for prospective ๐‘ฅ’s closer
to the average value of the retrospective ๐‘ฅ’s are more accurate—they have narrower confidence
intervals). Keep in mind that confidence intervals look like bow ties. Can you say why?
We can use decoys to verify that our predictions make sense. For instance, we can use one
subset of the retrospective data to predict another subset of the same data. This type of exercise
is what is used for machine learning and artificial intelligence. The idea behind the notion of
“training an algorithm” is simply finding better the ๐›ฝฬ‚’s to produce predictions with higher Rsquares of whatever is it that we care about. For instance, think of speech or face recognition.
To construct the confidence interval for our predictions, we think of ๐‘ฆ as a parameter given our
๐›ฝ’s and the ๐‘ฅ’s:
๐‘ฆ = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ1 + โ‹ฏ + ๐›ฝ๐‘˜ ๐‘ฅ๐‘˜
What would be our estimate of the “parameter” ๐‘ฆ? The fitted value ๐‘ฆฬ‚ = ๐›ฝฬ‚0 + ๐›ฝฬ‚1 ๐‘ฅ1 + โ‹ฏ + ๐›ฝฬ‚๐‘˜ ๐‘ฅ๐‘˜ . In
this context, ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘› are fixed numbers. Because of sampling, we think of the ๐›ฝฬ‚1 , ๐›ฝฬ‚2 , … , ๐›ฝฬ‚๐‘› as
random variables. Thus, ๐‘ฆฬ‚ is also a random variable with a distribution centered at ๐‘ฆ and some
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
49
standard error derived from the standard error of the ๐›ฝฬ‚’s. For each possible ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘› , statistical
software can give us ๐‘ฆฬ‚ and its standard error. To build the confidence interval we use both. For
instance, assume we choose a confidence level of 95%. Given some value of ๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘› , the 95%
confidence interval of our prediction is defined as:
95% ๐ถ. ๐ผ. = (๐‘ฆฬ‚ − 1.96 × ๐‘†. ๐ธ. , ๐‘ฆฬ‚ + 1.96 × ๐‘†. ๐ธ. )
We can look at a graphical example in two dimensions. Given a value of ๐‘ฅ, we build the
confidence interval of our prediction using the prediction itself (๐‘ฆฬ‚) plus/minus the t-statistic
corresponding to the confidence we want (the value 1.96 corresponds to 95% confidence)
multiplied by the standard error of ๐‘ฆฬ‚. By construction, the confidence interval is centered at the
prediction.
For a given ๐‘ฅ, what is the interpretation of the 95% confidence interval around ๐‘ฆฬ‚ pictured
above? It’s analogous to what we discussed before for the ๐›ฝฬ‚’s. Make sure you can explain this
with your own words.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
50
5.3 Prescriptive use
Whether in business, government or the not-for-profit sector, the ultimate goal of empirical
analysis is to produce recommendations. Based on evidence, we want to know what should be done
to improve the bottom line of a company or the results of a policy or program. Although the
descriptive and predictive uses of regression may shed some light on what could make sense to
do, they do not offer solid advice. For instance, think about the prices charged by a company. Are
the prices too high or too low relative to the profit maximizing level? You can imagine arguments
in favor of price increases, as well as arguments in favor or lower prices. In theory, it is unclear
whether the price is too high, too low or just right. In order to be able to make a recommendation
we need evidence. How would you determine whether a price increase would result in higher or
lower profits? This type of problems goes well beyond pricing decisions. It involves pretty much
every decision.
Imagine a retail company has 1,000 stores and 500 are upgraded with the intention of
improving customer satisfaction and boosting sales. The quarterly report is just in. Average sales
in stores without upgrade are 600 (to make the example more appealing, you can imagine sales
are expressed in thousands of dollars). Average sales in stores with upgrade are 550. Did the
upgrade cause a decrease in sales? What should the board of the company do? Expand the (costly)
upgrades to all remaining branches? There are several ways to answer this sort of questions.
That’s what we will learn in this section. Before discussing the empirical methods, we must
introduce a few concepts.
5.3.1 Causality and the Rubin model
We will start with the so-called Rubin Model. Let’s focus on the store-upgrade example. Take
the case of the store ๐‘–. Without the upgrade, sales at that store would have been ๐‘Œ0๐‘– . With the
upgrade, sales would have been ๐‘Œ1๐‘– . The causal effect of the upgrade is the difference in sales
across the two situations, which is given by ๐‘Œ1๐‘– − ๐‘Œ0๐‘– . Notice that the causal effect is defined for
each store ๐‘–. We are interested in the average causal effect across stores, which we call Average
Treatment Effect or ATE, and it is defined in terms of expected values:
๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– ] = ๐ธ[๐‘Œ1๐‘– ] − ๐ธ[๐‘Œ0๐‘– ]
However, for each store we only observe one situation (either it was upgraded or it wasn’t),
not both. Let ๐ท๐‘– denote a dummy indicating whether a store was upgraded. Thus, ๐ท๐‘– = 0 means
the store wasn’t upgraded and ๐ท๐‘– = 1 means the store was upgraded. We only observe:
๐‘Œ๐‘– = ๐‘Œ1๐‘– ๐ท๐‘– + ๐‘Œ0๐‘– (1 − ๐ท๐‘– )
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
51
In words, for stores with ๐ท๐‘– = 1 we observe ๐‘Œ1๐‘– , and for stores with ๐ท๐‘– = 0 we observe ๐‘Œ0๐‘– . We
know facts but not the counterfactuals. A counterfactual is what would have happened in an
alternative reality—e.g. think where you would be now had you not enrolled at this university.
In the table below we explain the difference between what we observe and we don’t observe.
Stores can be are divided based on whether they were upgraded or not. The first column
corresponds to stores that weren’t upgraded (๐ท๐‘– = 0). The second column corresponds to stores
that were upgraded (๐ท๐‘– = 1). The third column agglutinates all stores, regardless of upgrade
status. We divide the world into two alternative realities for each store. The first row corresponds
to a reality without upgrade (๐‘Œ0๐‘– ), and the second row correspond to a reality with upgrade (๐‘Œ1๐‘– ).
It should be clear that we only observe information for two cells of the table. We know the sales
without upgrade for stores that weren’t upgraded (600). We also know the sales with upgrade for
stores that were upgraded (550). We don’t observe the counterfactuals, that is, sales with upgrade
for stores that weren’t upgraded, and sales without upgrade for stores that were upgraded.
Naturally, we don’t know the average across all stores for each row. We don’t know the difference
across alternative realities for each group of stores either. So, there is a lot we don’t know.
๐ท๐‘– = 0
๐ท๐‘– = 1
All
600
?
?
Average sales with upgrade
?
550
?
Difference made by upgrade
?
?
?
Average sales without upgrade
However, at least conceptually, we can fill in the table with correct notions even if we don’t
observe them. That’s what the formulas in orange represent:
๐ท๐‘– = 0
Average sales without upgrade
Average sales with upgrade
Difference made by upgrade
๐ท๐‘– = 1
๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1]
๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0]
๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1]
๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– |๐ท๐‘– = 0] ๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– |๐ท๐‘– = 1]
All
๐ธ[๐‘Œ0๐‘– ]
๐ธ[๐‘Œ1๐‘– ]
๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– ]
We can adopt some useful definitions for the formulas in the last row. Those formulas provide
causal effects. The Average Treatment on the Untreated (ATU) is the average difference the
upgrade would make among stores that weren’t upgraded (i.e. the causal effect among untreated
stores):
๐ด๐‘‡๐‘ˆ = ๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– |๐ท๐‘– = 0]
The Average Treatment on the Treated (ATT) is the average difference the upgrade would
make among stores that were upgraded (i.e. the causal effect among treated stores):
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
52
๐ด๐‘‡๐‘‡ = ๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– |๐ท๐‘– = 1]
Lastly, as we saw before, the ATE is the average difference the upgrade would make among
all stores (i.e. the causal effect among the whole group of stores):
๐ด๐‘‡๐ธ = ๐ธ[๐‘Œ1๐‘– − ๐‘Œ0๐‘– ]
We can also express the ATE as the (weighted) average of the ATU and ATT. If we have the
500
500
same number of stores upgraded (500) and not upgraded (500), then ๐ด๐‘‡๐ธ = 1000
๐ด๐‘‡๐‘ˆ + 1000
๐ด๐‘‡๐‘‡.
Going back to our problem, what do you think is more relevant to know for the company when
assessing the upgrades? The ATE, the ATT or the ATU? What is the economic relevance of each
of them? As an exercise, imagine situations in which each of them may matter most.
In a naïve comparison (i.e. a simple difference of observed average sales across the two groups
of stores) we get:
๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
How different is that naïve comparison from the ATE, ATU or ATT? To find out, let’s add and
subtract the counterfactual ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] (which is the average sales without upgrade among
stores that were upgraded):
๐‘๐‘Ž๐‘–ฬˆ๐‘ฃ๐‘’ ๐‘๐‘œ๐‘š๐‘๐‘Ž๐‘Ÿ๐‘–๐‘ ๐‘œ๐‘› = ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
= ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] + ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
= ๐ด๐‘‡๐‘‡ + ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
= ๐ด๐‘‡๐‘‡ + ๐‘†๐‘’๐‘™๐‘’๐‘๐‘ก๐‘–๐‘œ๐‘› ๐ต๐‘–๐‘Ž๐‘  ๐‘–๐‘› ๐‘Œ0
The selection bias in ๐‘Œ0 is defined as ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]. How would you explain it in
your own words? Please try until you come up with a simple version.
Instead, we could add and subtract ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0] (i.e., the average sales with upgrade among
stores that weren’t upgraded):
๐‘๐‘Ž๐‘–ฬˆ๐‘ฃ๐‘’ ๐‘๐‘œ๐‘š๐‘๐‘Ž๐‘Ÿ๐‘–๐‘ ๐‘œ๐‘› = ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
= ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0] + ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
= ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0] + ๐ด๐‘‡๐‘ˆ
= ๐‘†๐‘’๐‘™๐‘’๐‘๐‘ก๐‘–๐‘œ๐‘› ๐ต๐‘–๐‘Ž๐‘  ๐‘–๐‘› ๐‘Œ1 + ๐ด๐‘‡๐‘ˆ
The selection bias in ๐‘Œ1 is defined as ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0]. Explain in your own words
how it may be different from the selection bias in ๐‘Œ0 . It should be clear that the sign and the
magnitude of the selection biases depend on the determinants of the upgrade status. What
possible stories can you come up with for the biases to be positive or negative?
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
53
Notice that, if there are no selection biases, then the ATU and the ATT are equal to the naïve
comparison, and therefore the ATE is also equal to the naïve comparison. Make sure you can
express this powerful idea in terms of the formulas above.
There are several lessons stemming from the Rubin model. First, if used to infer causal effects,
naïve comparisons can be misleading when there are selection biases. Second, without
counterfactuals, we cannot know the ATE, the ATT or the ATU. Third, since people act with
purpose, selection is ubiquitous and, in general, correlation isn’t indicative of causation. This
framework distinguishes economists from most other professionals—think about news reports
on drinking wine as a healthy habit or the effects of doing yoga on productivity at work.
Keep in mind that selection may occur in unobservable traits, such as motivation, perceptions,
grit, opinions, etc. As a general rule, we cannot be sure we control for selection simply by adding
more regressors in our analysis.
Let’s look at an example about educational attainment and earnings. In this case, the
treatment is “getting a college degree.” When we compare earnings of college graduates versus
earnings of people without a college degree (i.e. non-college graduates), we only observe the
figures in black in the table below. We don’t know the counterfactuals, e.g. how much would
non-college graduates earn had they gotten a college degree. Obviously, there may be selection
into college attendance. Imagine we were given believable estimates of the counterfactuals in red.
Based on those estimates, we can compute the causal effect of attending college on earnings.
Given the figures in the table, what could you say about selection biases? Can you compute them?
You surely can. Selection bias in ๐‘Œ0 is 30,000, whereas selection bias in ๐‘Œ1 is 39,000. Make sure you
know how to interpret the entire table.
Non-college
College
graduates (2/3)
graduates (1/3)
Without a college degree
45,000
75,000
55,000
With a college degree
60,000
99,000
73,000
Difference (with minus without)
15,000
24,000
18,000
Annual earnings at age 40 …
All
We don’t have to restrict ourselves to binary comparisons. Let’s look at another example with
three alternatives. In this case, we compare earnings of college graduates who attended different
higher education institutions. One could argue those institutions may have different effects on
the earnings of their graduates. However, students purposely seek admission only to some
institutions, and admissions officers purposely reject some applicants and admit others. Based on
those purposive behaviors, we expect some selection biases. Instead of presenting amounts of
money, the table below simply presents placeholders. The ones in black (A, E and I) are observed.
The rest (in blue) aren’t. A comparison of I vs E, or I vs A may not be informative of the difference
it makes to attend one university instead of the other. Appropriate comparisons require estimating
counterfactuals (the letters in blue).
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
Earnings if attended…
54
Institution attended
Chicago State
U. of I. at Chicago
U. of Chicago
Chicago State
A
B
C
U. of Illinois at Chicago
D
E
F
U. of Chicago
G
H
I
Before we proceed, let’s think of a last example also related to educational attainment. In this
case, we look at years of schooling, which we can think of as a continuous variable. People who
attain more years of schooling on average have higher earnings. The graph below shows an
example of a cloud of points representing observed annual earnings and for people with different
levels of schooling. We could run a regression of earnings (๐‘ฆ) on years of schooling (๐‘ฅ) in a model
such as ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐œ€๐‘– . Our regression would produce a slope of $12,000. We may be tempted
to conclude that, on average, attending college increases annual earnings by $48,000 (4 × $12,000).
In light of what we’ve discussed, do you agree with that conclusion? Is $12,000 a valid estimate
of the causal effect of a year of schooling?
Our estimate ๐›ฝฬ‚ is naïve. In general, it shouldn’t be interpreted as the causal effect of years of
schooling on earnings because there may be selection into years of schooling attained. Perhaps
people who attain more years of schooling would make more money than those who attained
fewer years of schooling even if everyone had the same educational attainment. Think of the
appropriate counterfactuals. The graph below shows two examples. The red triangles show
counterfactual earnings across different levels of schooling for people who in fact attained twelve
years of schooling. Obviously, factual and counterfactual earnings coincide for twelve years of
schooling. The green diamonds show counterfactual earnings for people who attained sixteen
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
55
years of schooling. In this case, factual and counterfactual earnings coincide for sixteen years of
schooling. In this example, given equal schooling, people who attained sixteen years of schooling
(green markers) on average would make more money than people who attained twelve years (red
markers). That means there is a positive selection bias. Those who would earn more ceteris
paribus are also the ones who end up with more schooling.
If we could observe the counterfactuals in the graph above, we would run a regression and
adequately estimate the causal effect of one year of schooling on earnings. In that example, the
slope would be $6,000, which is half the naïve estimate (the other half is the selection bias). This
is just one example in which I assumed positive selection. As you can imagine, there are many
theoretical possibilities for the selection biases. Try go over a few examples with different signs.
The example above illustrates that, as a general rule, we shouldn’t interpret regression
coefficients as estimates of causal effects. If we want to do that, we need good reasons why we
should believe there are no selection biases.
The group of techniques used to estimate counterfactuals and gauge causal effects is referred
to as impact evaluation. Next, we will learn three ways to estimate causal effects that avoid
selection and other biases.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
56
5.3.2 Randomized-Control Trials
Organizations want to find what works best for them and do it. For instance, they may be
interested in contrasting the status quo of a program or policy versus a new idea or a set of
different new ideas. When trying to prescribe what to do, it’s helpful to think in terms of a medical
analogy. There is a diagnosis, i.e. the situation we believe is wrong or that could be improved.
There is also a treatment, i.e. the action that will cause the correction or improvement. It must be
something we can manipulate—academics may be interested in non-manipulated causes but
that’s not the case among practitioners. Lastly, there is an outcome or metric of interest, i.e. the
variable where success would be observed. In a nutshell, we attempt to estimate the causal effect
of the treatment on the outcome of interest, and decide whether the treatment should be
introduced, stopped or continued.
The key concept here is causality. There is no statistical test for causality. It is a logical—not
statistical—concept. The methods used for estimating causal effects are usually referred to as
impact evaluation. We will study three of the most common approaches used in impact
evaluation. Please look at the World Bank’s Impact Evaluation in Practice, which is freely available
online here.
Ideally, we would like to conduct an experiment or randomized-control trial (RCT), the gold
standard in impact evaluation. To see it in practice, imagine a company that delivers a newsletter
by email to a list of one million subscribers. The newsletter contains offers and is used as a sales
tool. A metric of success is email opening—it leads to sales. Someone in the company detects an
area of opportunity. The subject line in those emails is “impersonal and unappealing.” One idea
is to include the recipient’s given name in the subject line (e.g. Pablo, great deals just for you!). Other
people think such strategy would become ineffective after a while, when the recipients get used
to seeing their name in the subject line. Someone suggests alternating subjects at different
frequencies. We can imagine ๐‘˜ different treatments in addition to the regular subject line (the
control group that represents the status quo). Treatment 1 would include the recipient’s name in
every newsletter. Treatment 2 would include the recipient’s name every other newsletter.
Treatment 3 would include the recipient’s name every three newsletters, and so on. In terms of ๐‘Œ๐‘–
in the Rubin Model, we would have 1,2, … , ๐‘˜ alternative treatments and therefore ๐‘˜ + 1 possible
outcomes ๐‘Œ0๐‘– , ๐‘Œ1๐‘– , ๐‘Œ2๐‘– , … , ๐‘Œ๐‘˜๐‘– . What steps should we take?
Start with the list of all email recipients of the newsletter. Choose a random sample for the
experiment. Within that sample, randomize who gets which of the treatments and who doesn’t
get any. In other words, using a lottery we create the ๐‘˜ treatment groups and a control group.
Different treatments are called arms of treatment. Apply the treatment and measure what
happens after time elapses (say, three months later). To determine what worked best, compare
the metrics of interest (the email opening rate). You could do this with a regression model:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ๐‘–1 + ๐›ฝ2 ๐‘ฅ๐‘–2 + โ‹ฏ + +๐›ฝ๐‘˜ ๐‘ฅ๐‘–๐‘˜ + ๐œ€๐‘–
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
57
where ๐‘ฅ๐‘–1 is a dummy indicating whether recipient ๐‘– received treatment 1, ๐‘ฅ๐‘–2 is a dummy
indicating whether recipient ๐‘– received treatment 2, and so on. Since each treated recipient
receives one treatment only, then ∑๐‘˜๐‘—=1 ๐‘ฅ๐‘–๐‘— = 1 among treated patients. Among non-treated
patients we have ∑๐‘˜๐‘—=1 ๐‘ฅ๐‘–๐‘— = 0. Notice that:
๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–1 = ๐‘ฅ๐‘–2 = โ‹ฏ = ๐‘ฅ๐‘–๐‘˜ = 0] = ๐›ฝฬ‚0
๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–1 = 1] = ๐›ฝฬ‚0 + ๐›ฝฬ‚1
โ‹ฎ
๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–๐‘˜ = 1] = ๐›ฝฬ‚0 + ๐›ฝฬ‚๐‘˜
In words, ๐›ฝฬ‚0 is the average opening rate in the control group, and ๐›ฝฬ‚0 + ๐›ฝฬ‚๐‘— is the average opening
rate in the arm of treatment ๐‘—, where ๐‘— = 1,2, . . , ๐‘˜. Our estimate of the causal effect of receiving
treatment 1 is given by the difference with respect to the control group:
๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–1 = 1] − ๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–1 = ๐‘ฅ๐‘–2 = โ‹ฏ = ๐‘ฅ๐‘–๐‘˜ = 0] = (๐›ฝฬ‚0 + ๐›ฝฬ‚1 ) − ๐›ฝฬ‚0 = ๐›ฝฬ‚1
In a similar way, we get estimates of the causal effect of the other treatments. We can also compare
causal effects across treatments. For instance, we may be interested in whether the causal effect
of treatment 3 is greater than the causal effect of treatment 2. Our estimate of the difference
between the causal effects of those two treatments is:
๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–3 = 1] − ๐ธ[๐‘ฆ๐‘– |๐‘ฅ๐‘–2 = 1] = (๐›ฝฬ‚0 + ๐›ฝฬ‚3 ) − (๐›ฝฬ‚0 + ๐›ฝฬ‚2 ) = ๐›ฝฬ‚3 − ๐›ฝฬ‚2
We already know that our estimates ๐›ฝฬ‚0 , ๐›ฝฬ‚1 , … , ๐›ฝฬ‚๐‘˜ have standard errors associated to them.
Therefore, we can measure significance, build confidence intervals, and test (joint) hypotheses.
The crucial part is that we can causally attribute differences in the metric of interest (the opening
rate) to differences in the treatment received (subject lines). Once we determine which treatment
works best, we can implement it in the whole mailing list.
To make perfectly clear why an RCT is the ideal way to measure causal effects, let’s revisit
our store upgrade example. Assume an RCT was performed (i.e. selection of stores for the
upgrade was random). Thus, since “coins were tossed” to decide which stores were upgraded,
we have that:
๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 1] = ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] = ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 0]
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
58
In words, we expect no selection bias of any kind (there may still be some difference due to the
luck of the draw, but the chances are tiny). Thus, the naïve comparison in an RCT produces the
ATE, which in turn is equal to the ATU and ATT:
๐ด๐‘‡๐ธ = ๐ด๐‘‡๐‘‡ = ๐ด๐‘‡๐‘ˆ = ๐ธ[๐‘Œ1๐‘– |๐ท๐‘– = 1] − ๐ธ[๐‘Œ0๐‘– |๐ท๐‘– = 0]
Why are experiments so useful? The source of identification (the variation that allows us to
make causal claims) is the randomization of allocation into the treatment or control groups. That
means the only systematic difference across treatment and control groups (or between different
arms) is the treatment. Obviously, there is no selection bias. But it’s just as important to make
clear that there is no omitted variable bias either. That’s because the treatment status is not related
to any other determinant of the metric of interest. Last but not least, in experiments we usually
control sample size and therefore we have a handle on the precision of our estimates (remember
that sample size mechanically affects standard errors, which in turn determine significance and
confidence intervals).
At this point it’s crucial to establish a difference between two concepts. Statistical significance
and economic relevance. In colloquial terms, statistical significance means our estimate is
different from zero, and it doesn’t look like that’s the result of chance—we have a small p-value,
regardless of its magnitude. In contrast, economic relevance means the magnitude of our
estimate indicates the causal effect makes an important difference—it has a considerable
magnitude, regardless of the p-value. They may or may not come together. Keep in mind they
are separate concepts. To see the difference, think of the following statements. Ceteris paribus,
sample size can always change the statistical significance of any (non-zero) estimate, even if it’s
small and therefore economically irrelevant. At the same time, ceteris paribus, a very high
opportunity cost can always make any estimate economically irrelevant.
When designing an experiment, we must consider both concepts to determine our sample
size. A sample size that is too large may result in a partial waste, but a sample that is too small
could result in a total waste. Let explain both cases.
Having a sample that is too large is a concern because of costs. Think of out-of-pocket expenses
(e.g. data gathering, conducting surveys on the field, paying contractors to clean data, acquiring
the right hardware and software) as well as non-monetary costs (some organizations don’t like
the idea of fiddling with their operation at a large scale or with many of their customers). Given
those costs, having an unnecessarily large sample is a partial waste. We could do just as well with
a smaller, less costly sample. However, a sample size that is too small is worse. We may not be
able to tell whether whatever estimate we get is the result of luck or not (our standard errors
would be large). That would be a total waste.
How do we decide ex ante on the right sample size for an RCT? We need information on how
much the metric of interest varies. Intuitively, if the outcome of interest varies very little, then a
causal effect of a given magnitude is easier to identify than when the metric of interest varies a
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
59
lot. To illustrate this, let’s revisit the email opening example. Assume we know the standard
deviation of the number of newsletters opened is ๐œŽ๐‘Œ . Assume also that the CEO of the company
considers appropriate a confidence level of 95%. Let’s focus on ๐›ฝฬ‚1 , which is the causal effect of
treatment 1 relative to the control. If it’s close to zero, it means that treatment doesn’t work—it
doesn’t improve the opening rate. If it’s negative, it means it performs worse than the status quo.
By assuming ๐›ฝ1 = 0, we have a distribution of ๐›ฝฬ‚1 centered at zero with a standard deviation that
is an increasing function of the ratio ๐œŽ๐‘Œ ⁄√๐‘, where ๐‘ is the sample size. A larger ๐‘ means a
narrower distribution of ๐›ฝฬ‚1 . Thus, given the same estimate, larger samples would place that
estimate in the rejection region of the hypothesis ๐›ฝ1 = 0, and smaller samples would place it in
the no-rejection region. The graph below illustrates this idea.
The hypothesis tested is the same (๐›ฝ1 = 0) and the estimate is also the same (๐›ฝฬ‚1 = 1.5). The
difference in the distributions comes from different hypothetical sample sizes calculated a priori.
The orange distribution comes from a sample four times the size of the sample of the blue
distribution. In one case, the estimate would fall in the rejection region (orange distribution, with
larger sample size) and in the other it would fall in the no-rejection region (blue distribution, with
smaller sample size).
We know that a greater sample size results in a larger rejection region. At the same time, not
all magnitudes of causal effects are economically relevant. Why waste resources detecting the
magnitudes that are irrelevant? Based on economic criteria, we can define a priori the minimum
detectable effect or MDE we are interested in. Using ๐œŽ๐‘Œ , we can compute the sample size
consistent with such MDE. For instance, imagine treatments are costly. We are only interested in
effects that surpass their costs, which means they are above a threshold ๐ต > 0. We can use that
threshold to reverse engineer the appropriate sample size. We start with the number ๐ต, which is
the minimum magnitude that is relevant to us. We calculate the “right” sample size given ๐œŽ๐‘Œ and
a confidence level, so that the MDE is equal to ๐ต.
The table below shows different sample sizes as a function of the desired MDE given two
possible values of the standard deviation of the metric of interest (๐œŽ๐‘Œ = 1 and ๐œŽ๐‘Œ = 2). We use
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
60
95% confidence, but we could pick any level we want. We assume a control group and just one
treatment group (of equal size). The sample size shown includes both groups (control and
treatment). The principle is simple. We find the sample size that (a priori) would lead us to reject
the hypothesis ๐›ฝ = 0 if we had ๐›ฝฬ‚ = ๐ต. Keep in mind that the sample size enters the calculation
through the standard error of ๐›ฝฬ‚. As an example, assume we want an MDE of 0.4. With a standard
deviation of 1, the sample size should be 200. A larger sample size would allow us detecting
smaller causal effects, but those effects would be economically irrelevant. If the standard
deviation is 2, then we need a sample size of 788 to achieve the same MDE. Statistical software
does this for us very easily.
Minimum
detectable effect
(at 95% confidence)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Standard deviation of the
metric of interest
1.00
2.00
Sample size
3,142
12,562
788
3,142
352
1,398
200
788
128
506
90
352
68
260
52
200
42
158
34
128
Experiments may seem like a silver bullet, but they face important challenges. Some of those
challenges are technical (e.g. subjects knowing they are part of an experiment), ethical (e.g. when
the control group is excluded from a potentially beneficial treatment) or legal (e.g. price
discrimination). Sometimes it’s logistically impossible to run an experiment (e.g. when there is
no way to exclude the control group from the treatment). In those circumstances we must rely on
quasi-experiments, which are situations that, to some extent, resemble an experiment. The most
common quasi-experimental approaches are Regression Discontinuity Design and Difference in
differences.
5.3.3 Regression Discontinuity Design
In some situations, a treatment is assigned based on whether a variable (known as the
running variable or the assignment variable) passes a threshold (a cutoff point). If the treatment
has a causal effect, then we expect a discontinuity (a “jump”) in the outcome of interest at the
cutoff. As an example, think of the problem of determining whether attending a selective
enrollment school makes a difference in earnings in adulthood. Supposed all applicants must take
an admissions test. We denote the score on that test by the variable ๐‘ฅ. Only applicants with a
score equal to or above ๐ถ are admitted. In other words, applicant ๐‘– is admitted if and only if ๐‘ฅ๐‘– ≥
๐ถ. All applicants who don’t make the cutoff are rejected and attend a non-selective school (which
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
61
is believed to be worse). A few years later, when those subjects are thirty years old, we observe
their annual earnings, which we denote by ๐‘ฆ. If attending the selective school has an effect on
earnings down the road, we expect greater average earnings among its graduates in comparison
to the graduates of the non-selective option. However, a simple comparison of averages could be
misleading. After all, the fact that students are screened for admission into the selective option
creates a selection bias. In other words, even if the causal effect we are interested in is zero, we
may find a difference between the earnings of graduates of the two schools.
To avoid the selection bias, we can look applicants with test scores in the vicinity of the cutoff
๐ถ. We could argue that, if we look close to the cutoff, whether an applicant was admitted or not
is a matter of luck. In fact, when we zoom in, it looks like a small experiment where, solely by
chance, some students got a few more points than others in the admissions test, but on average
they are similar in any other respect. Thus, any difference in earnings between applicants right
below the cutoff and applicants right above the cutoff must be the result of attending different schools.
The following graph illustrates this point. The cloud of points represents earnings at age thirty
and scores in the admissions test. The points to the left of the cutoff ๐ถ correspond to applicants
who attended the non-selective school, whereas the points to the right correspond to applicants
who attended the selective school. Since earnings in adulthood and academic performance are
usually related, we expect the cloud to show a positive trend. But that trend should be smooth—
without jumps. If there is a jump at the cutoff, then we can attribute the difference in earnings to
the difference in schools. We can fit a model with a jump at ๐ถ. Let ๐‘‘ be a dummy such that ๐‘‘๐‘– = 0
if ๐‘ฅ < ๐ถ, and ๐‘‘๐‘– = 1 if ๐‘ฅ ≥ ๐ถ. In words, ๐‘‘ represents the treatment (defined as attending the
selective school instead of the non-selective school). Our regression model is ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘ฅ๐‘– + ๐›พ๐‘‘๐‘– +
๐œ€๐‘– . The coefficient ๐›พฬ‚ is our estimate of the causal effect at the discontinuity created by the cutoff
๐ถ. This set up is called Regression Discontinuity Design or RDD.
RDDs may come in different forms. For instance, we may want to use a polynomial or even
different polynomials at each side of the cutoff. If we truly have a discontinuity, we expect our
regression to catch it. To show this, let’s look at another example. Suppose a company is deploying
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
62
a training program for its workforce. They can only afford to train 600 of their 1500 workers. To
decide who is trained and who isn’t, the company gives priority to the youngblood. All 1500
workers are sorted according to their age in months. The youngest 600 are sent to training. One
year later, the company measures the productivity of all 1500 workers. They want to determine
whether the training program made a difference or not in terms of worker performance. In this
case, our running variable ๐‘ฅ is age in months and our metric of interest ๐‘ฆ is worker performance
a year after the training program took place. The treatment was given only to the 600 youngest
workers. Let’s suppose that the oldest treated worker was ๐ด months old at the moment of the
selection. Then, we define ๐‘‘ as a dummy such that ๐‘‘๐‘– = 1 if ๐‘ฅ ≤ ๐ด, and ๐‘‘๐‘– = 0 if ๐‘ฅ ≥ ๐ด. The graph
below shows the cloud of points in terms of performance and age of the workers.
Since it doesn’t look linear, we introduce in our regression quadratic polynomials at each side
of the cutoff. Our model would be:
๐‘ฆ๐‘– = ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ๐‘– + ๐›ฝ2 ๐‘ฅ๐‘–2 + ๐‘‘๐‘– (๐›พ0 + ๐›พ1 ๐‘ฅ๐‘– + ๐›พ2 ๐‘ฅ๐‘–2 ) + ๐œ€๐‘–
= ๐›ฝ0 + ๐›ฝ1 ๐‘ฅ๐‘– + ๐›ฝ2 ๐‘ฅ๐‘–2 + ๐›พ0 ๐‘‘๐‘– + ๐›พ1 ๐‘‘๐‘– ๐‘ฅ๐‘– + ๐›พ2 ๐‘‘๐‘– ๐‘ฅ๐‘–2 + ๐œ€๐‘–
= (๐›ฝ0 + ๐›พ0 ๐‘‘๐‘– ) + (๐›ฝ1 + ๐›พ1 ๐‘‘๐‘– )๐‘ฅ๐‘– + (๐›ฝ2 + ๐›พ2 ๐‘‘๐‘– )๐‘ฅ๐‘–2 + ๐œ€๐‘–
The estimate of the causal effect would be ๐›พฬ‚0, which is the jump at the discontinuity ๐ด, and
it’s positive in the graph above. Notice that the slope of our fitted model could be the different
around the discontinuity. At the cutoff, the slope approaching ๐ด from the right would be ๐›ฝฬ‚1 +
2๐›ฝฬ‚2 ๐ด , whereas approaching it from the left it would be ๐›ฝฬ‚1 + ๐›พฬ‚1 + 2(๐›ฝฬ‚2 + ๐›พฬ‚2 )๐ด. If ๐›พฬ‚1 ≠ 0 or ๐›พฬ‚2 ≠ 0
then the slopes at the discontinuity would differ.
By now it should be clear that we can use polynomials in our RDD. We can also add covariates
and interactions. However, we must always verify compliance with the cutoff. We must make sure
there are no signs of manipulation or cheating of the running variable. The interpretation of the
magnitude of our estimate of the causal effect is valid close to the discontinuity, not far. Please
explain in your own words why.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
63
When using an RDD, it’s inevitable to run into questions related to whether we should limit
our analysis to a small vicinity around the cutoff and whether we should use linear, quadratic,
cubic, or higher order polynomials to control for the smooth trend. There is no rule of thumb to
determine what should be done. Instead of choosing on particular model, it’s convenient to think
about trying different combinations of vicinities and trend controls as robustness checks.
Always remember that, when we use an RDD, our estimates are informative of causal effects
close to the cutoff, and their validity hinges on compliance with the cutoff.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
64
5.3.4 Difference-in-Differences
In many instances, we observe the outcome of interest across periods for a control group and
a treatment group. If we assume that in absence of the treatment the trend would be the same across
the two groups, we can estimate the causal effect of the treatment. This assumption can be
referred to as parallelism, because it means we would observe parallel trends in the control and
treatment groups in the outcome of interest if there wasn’t a treatment. In its simplest form, what
we observe can be described by the following table. The value inside each cell represents the
observed average of the outcome we care about. In other words, the table depicts facts.
Observed averages
Before treatment
Not treated
๐ด
Treated
๐ต
After treatment
๐ถ
๐ท
๐ถ−๐ด
๐ท−๐ต
Difference
To determine the effect of the treatment, we must create a counterfactual, which in this case
is the average we would observe in absence of the treatment in the treated group. When we apply
a Difference in Differences or Diff-in-Diff approach, we assume a parallel trend in time across
groups. If the average grew from ๐ด to ๐ถ among the non-treated and we assume a parallel trend
among the treated, then without the treatment the average among the treated would have grown
๐ถ − ๐ด. Since the starting point is ๐ต, the ending point would be ๐ต + (๐ถ − ๐ด). That’s the
counterfactual we are looking for. The estimate of the causal effect of the treatment is the
difference between the observed average, ๐ท, and the counterfactual, ๐ต + (๐ถ − ๐ด).
Treatment effect = ๐ท − [๐ต + (๐ถ − ๐ด)]
If we rearrange the expression, we get a more intuitive expression—a double difference:
Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด)
The first term at the right-hand side is the observed difference across periods among the treated
units. The second term is the observed difference across periods among the untreated units. Our
estimate of the treatment effect is the difference between the two differences—hence the name of the
method. Notice that the idea is general and doesn’t depend on the order of the differences.
Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด) = (๐ท − ๐ถ) − (๐ต − ๐ด) = ๐ท − [๐ถ + (๐ต − ๐ด)]
Implicitly, we are assuming the counterfactual is ๐ถ + (๐ต − ๐ด). In words, we could also say that
the counterfactual is built as the observed average after the treatment among the non-treated,
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
65
plus the difference across groups before the treatment took place. That’s just another way to
interpret the same assumption of parallel trends.
Let’s look at a graphic version of Diff-in-Diff. The horizontal axis represents the two periods.
The two solid dots denote observed averages. If we assume parallel trends absent the treatment,
then the change we would expect among the treated in absence of the treatment is ๐ถ − ๐ด. To
preserve parallelism, we add that amount to the starting point for the treated, which is ๐ต. The
difference between ๐ท and the counterfactual average ๐ต + (๐ถ − ๐ด) is our estimate of the treatment
effect.
The Diff-in-Diff approach is very intuitive. How do we implement it in the regression context?
We start with two dummies. The first dummy, ๐‘‘๐‘‡ , denotes whether an observation belongs to the
treated group (๐‘‘๐‘‡ = 1) or not (๐‘‘๐‘‡ = 0). The second dummy, ๐‘‘ ๐ด , indicates whether an observation
corresponds to the period before (๐‘‘ ๐ด = 0) or after (๐‘‘ ๐ด = 1) the treatment. Their interaction, ๐‘‘๐‘‡ ๐‘‘ ๐ด ,
indicates the situation where the treated group has been treated. The outcome of interest can be
expressed as:
๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘‘๐‘–๐‘‡ + ๐›พ๐‘‘๐‘–๐ด + ๐›ฟ๐‘‘๐‘–๐‘‡ ๐‘‘๐‘–๐ด + ๐œ€๐‘–
By substituting the four possible combinations of ๐‘‘๐‘–๐‘‡ and ๐‘‘๐‘–๐‘‡ we can clearly see the
correspondence between the coefficients in the regression and the observed averages.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
66
Facts: observed averages
Not treated (๐‘‘๐‘–๐‘‡ = 0)
Treated (๐‘‘๐‘–๐‘‡ = 1)
๐ด=๐›ผ
๐ต =๐›ผ+๐›ฝ
๐ถ =๐›ผ+๐›พ
๐ท =๐›ผ+๐›ฝ+๐›พ+๐›ฟ
Before treatment (๐‘‘๐‘–๐ด = 0)
After treatment
(๐‘‘๐‘–๐ด
= 1)
Let’s make sense of the above equalities. The coefficient ๐›ผ is catching the pre-treatment level
among the non-treated. The coefficient ๐›ฝ is catching the difference between treated and nontreated in absence of the treatment—bear in mind that pre-treatment averages don’t need to be
the same. The coefficient ๐›พ is catching the trend in absence of treatment among the non-treated.
Lastly, the coefficient ๐›ฟ is catching the trend among the treated that is above (or below) the trend
among the untreated, and that’s our estimate of the treatment effect. In other words, our Diff-inDiff estimate of the treatment effect is the coefficient on the interaction between the dummies:
Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด)
= ([๐›ผ + ๐›ฝ + ๐›พ + ๐›ฟ] − [๐›ผ + ๐›ฝ]) − ([๐›ผ + ๐›พ] − [๐›ผ])
= (๐›พ + ๐›ฟ) − (๐›พ)
=๐›ฟ
If we estimate the regression ๐‘ฆ๐‘– = ๐›ผ + ๐›ฝ๐‘‘๐‘–๐‘‡ + ๐›พ๐‘‘๐‘–๐ด + ๐›ฟ๐‘‘๐‘–๐‘‡ ๐‘‘๐‘–๐ด + ๐œ€๐‘– , the equalities between
regression coefficients and averages displayed in the table above would necessarily hold. However,
if we include other regressors as controls, the equalities will not hold in general because our
regression coefficients would reflect averages adjusted for other factors. The interpretation is
similar, but the values wouldn’t be identical.
It’s important to note that we run regressions and not just create a table with observed
averages because of two reasons. First, a regression allows us to control for other variables—we
already discussed the benefits of this. Second, a regression allows us to make statements about
significance and test hypothesis. Anyone can compute a table like the one above. But it takes a good
understanding of econometrics to interpret it in a way that is helpful to make decisions.
Let’s consider a practical example. A company that owns a chain of retail stores has upgraded
a few of them that are in the urban area of Chicago. The rest, which are located in suburban areas,
weren’t upgraded. With data from two years and the two types of locations you can estimate a
causal effect. Assume the treatment took place at the end of 2018. Thus, you can consider 2018 the
pre-treatment period and 2019 the post-treatment period.
Year
Chicago stores
Suburban
Urban
2018
A
B
2019
C
D
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
67
We could apply the regression model above and estimate ๐›ฟ. It should be obvious that our
assumption of parallel behavior may be off. Do we really expect the same trends in urban and
suburban locations? It’s a valid criticism but at this point there isn’t much you can do to address
it.
Now, imagine the company also has stores in the metropolitan area of Milwaukee. The good
news is, Milwaukee hasn’t been reached by this upgrade program. Could we use that additional
information? Yes. Milwaukee may provide information on whether the trends between sales in
urban stores are parallel to sales in suburban stores.
Year
Milwaukee stores
Chicago stores
Suburban
Urban
Suburban
Urban
2018
E
F
A
B
2019
G
H
C
D
In Milwaukee, the trend in average urban sales is equal to ๐ป − ๐น, whereas the trend in average
suburban sales is ๐บ − ๐ธ. None of those stores has been upgraded. Thus, we can measure the
difference in trends in absence of treatment (something we cannot do for Chicago stores.) The
difference in trends across urban and suburban stores absent the treatment is (๐ป − ๐น) − (๐บ − ๐ธ).
This is a crucial piece of information. If this difference in trends is nonzero, then our Diff-in-Diff
estimates for Chicago based on the formula (๐ท − ๐ต) − (๐ถ − ๐ด) may be off. A natural idea is to
subtract the trend in Milwaukee from the Diff-in-Diff estimate for Chicago:
Treatment effect = [(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)]
Intuitively, this is a triple difference. It is the difference between two Diff-in-Diff estimates. One
of them has the treatment, and the other doesn’t—that’s our decoy. The decoy allows us to account
for the trend. The assumption of parallelism was relaxed a little. It’s still there but in a subtler way.
Notice that, if the trends are truly parallel between urban and suburban stores in Milwaukee, then
the [(๐ป − ๐น) − (๐บ − ๐ธ)] = 0 and the triple difference estimate would be identical to the Diff-inDiff estimate.
How do we implement the triple difference? We create a new dummy, denoted by ๐‘‘๐‘–๐ถ ,
indicating Chicago stores (๐‘‘๐‘–๐ถ = 1) or Milwaukee stores (๐‘‘๐‘–๐ถ = 0). We interact this dummy with
our previous model and add new coefficients (knowing the Greek alphabet comes in handy):
๐‘ฆ๐‘– = ๐œ‚ + ๐œƒ๐‘‘๐‘–๐‘ˆ + ๐œ†๐‘‘๐‘–๐ด + ๐œ‡๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ด + ๐‘‘๐‘–๐ถ × (๐œŒ + ๐œ๐‘‘๐‘–๐‘ˆ + ๐œ‘๐‘‘๐‘–๐ด + ๐œ“๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ด ) + ๐œ€๐‘–
If we arrange the expression, we get to:
๐‘ฆ๐‘– = ๐œ‚ + ๐œƒ๐‘‘๐‘–๐‘ˆ + ๐œ†๐‘‘๐‘–๐ด + ๐œ‡๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ด + ๐œŒ๐‘‘๐‘–๐ถ + ๐œ๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ถ + ๐œ‘๐‘‘๐‘–๐ด ๐‘‘๐‘–๐ถ + ๐œ“๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ด ๐‘‘๐‘–๐ถ + ๐œ€๐‘–
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
68
If we go back to the table format and substitute all dummies by their respective values (0 or 1),
we have the equivalence of the average in each cell.
Year
Milwaukee stores
๐‘‘๐‘–๐ถ
Chicago stores
๐‘‘๐‘–๐ถ = 1
=0
2018 ๐‘‘๐‘–๐ด = 0
Suburban
๐‘‘๐‘–๐‘ˆ = 0
๐œ‚
Urban
๐‘‘๐‘–๐‘ˆ = 1
๐œ‚+๐œƒ
Suburban
๐‘‘๐‘–๐‘ˆ = 0
๐œ‚+๐œŒ
Urban
๐‘‘๐‘–๐‘ˆ = 1
๐œ‚+๐œƒ+๐œŒ+๐œ
2019 ๐‘‘๐‘–๐ด = 1
๐œ‚+๐œ†
๐œ‚+๐œƒ+๐œ†+๐œ‡
๐œ‚+๐œ†+๐œŒ+๐œ‘
๐œ‚+๐œƒ+๐œ†+๐œ‡
+๐œŒ + ๐œ + ๐œ‘ + ๐œ“
In this case, our triple-difference estimate of the treatment effect is:
Treatment effect = [(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)]
= [(๐œ† + ๐œ‡ + ๐œ‘ + ๐œ“) − (๐œ† + ๐œ‘)] − [(๐œ† + ๐œ‡) − ๐œ†]
= [๐œ‡ + ๐œ“] − [๐œ‡]
=๐œ“
In words, the triple-difference estimate of the causal effect is the coefficient on the interaction
of the three dummies, i.e. ๐‘‘๐‘–๐‘ˆ ๐‘‘๐‘–๐ด ๐‘‘๐‘–๐ถ .
Now, someone may question the validity of using Milwaukee as a comparison group for
Chicago. Some may argue that urban-vs-suburban trends in one city may not be parallel across
cities. Can we do anything else? It all depends on the availability of data. Imagine that we have
data from 2016 and 2017 for both cities.
Year
Milwaukee stores
Chicago stores
Suburban
Urban
Suburban
Urban
2016
I
J
M
N
2017
K
L
O
P
The analogous to our triple difference estimate for this decoy period 2016-2017 is:
[(๐ฟ − ๐ฝ) − (๐พ − ๐ผ)] − [(๐‘ƒ − ๐‘) − (๐‘‚ − ๐‘€)]
What does it mean? It’s the gap in trends in the urban-suburban differences between Chicago and
Milwaukee before the treatment took place. We can estimate the causal effect using a quadruple
difference, that is the difference between triple differences:
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
69
Treatment effect = {[(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)]}
− {[(๐ฟ − ๐ฝ) − (๐พ − ๐ผ)] − [(๐‘ƒ − ๐‘) − (๐‘‚ − ๐‘€)]}
Obviously, we can write the regression model equivalent to the expression above by simply
doubling the terms in our previous model. The logic is exactly the same as before. In cases like
this, writing the equation is more complicated that running the actual a regression in Stata. As an
exercise, write the equation for the quadruple difference.
Every time we take an additional difference, we are purging out an additional trend and
making our estimates more believable. Think of having more and more sophisticated decoys with
every additional difference.
It’s important to say that the Diff-in-Diff approach doesn’t require us to have information
across periods. That’s its more natural context. But we can think of the same type of estimation
using cross-sectional data—i.e. data for one period alone. Imagine we have the same problem as
before, but you only observe data for 2019. What could you do? Would you still be able to
compute differences? The answer is affirmative. You’d go back to the simpler double difference
model using Milwaukee and Chicago:
Location
Type of stores
Suburban
Urban
Milwaukee
A
B
Chicago
C
D
As before, we can think of the difference between ๐ท and ๐ต + (๐ถ − ๐ด) is our estimate of the
treatment effect. How believable our estimates are hinges on the assumption of parallelism.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
70
5.3.5 Other techniques
There are other quasi-experimental approaches that are less frequently used in practice, and
there is a good reason. Stakeholders (those who use the estimates) prefer to make decisions based
on what is more convincing and intuitive. Hence RDD and Diff-in-Diff are the more commonly
used quasi-experimental techniques. But there are other methods—mostly used in academia.
Here are three. If you want to learn more about them, please see the World bank book Impact
Evaluation in Practice.
First, we have Instrumental Variables or Two-Stage Least Squares. This is a very specific
type of quasi-experiment in which we find a variable (the instrument) that induces a treatment in
an exogenous way and isn’t directly related to the outcome of interest. We exploit that exogenous
variation to measure the causal effect. In practice, it’s difficult to find an instrument that is both
exogenous and unrelated to the outcome of interest. And if you find one, it’s just as difficult to
convince people of its validity (i.e. that satisfies the two assumptions).
Second, we have matching. Basically, we find matches for the treated subjects among a group
of non-treated subjects. This is very intuitive but also very unconvincing. Think about the
following question. Why would there be apparently similar people some of whom took the
treatment while others didn’t? Unless we have an experiment, there are good reasons to be skeptic
about this method.
Third, we have propensity score matching. In a way, it’s similar to matching, but the
matching is done by groups in terms of their probability of being treated. This method is neither
intuitive nor convincing. That’s why is rarely used to make decisions in practice. Its use is mostly
confined to academic studies.
6 Additional topics
Even if you don’t run regressions for a living, you’re probably going to encounter them.
Perhaps someone will try to persuade you by showing you some regression results. Here are a
few concepts to keep in mind.
6.1 Robustness
We say a result is robust if it doesn’t change much when we consider reasonable variations
in our analysis. Those variations are called robustness checks, and here are some examples. First,
we can try different samples with similar information (e.g. other periods or regions, some
subgroups). We can include or exclude different sets of controls. We can try different functional
forms (e.g. linear vs quadratic, cubic, logarithmic, interactions).
There is no formal test for robustness. It’s a qualitative result based on common sense. Consider
the following examples of results that wouldn’t be robust. We use a different sample in the gym
attendance prediction, and the predicted values change when we use a previous cohort. We use
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
71
a different functional form in an RDD, and the discontinuity estimate changes when we fit a
quadratic polynomial in the running variable instead of a linear polynomial. In a Diff-in-Diff
model, the estimates of the causal effect change when we add a triple or a quadruple difference.
To show that a result is robust, practitioners should display (or at the very least mention)
results with different samples, sets of controls and functional forms. When they don’t, be
suspicious.
6.2 Forensic analysis
Professional life involves reviewing the empirical analyses of other people. It could be
colleagues, intellectual adversaries, or academic researchers. To understand those results and
determine how much we believe them, we should proceed by deconstructing the cake. The first step
is to determine what the analysis is trying to do. We gain a lot of mileage from having the goal
clear. Is it trying to produce a description, a prediction or a prescription? Remember that the way
we judge the quality of a description is different from the way we judge the quality of a prediction
or a prescription.
The second step is to try to prove the analysis wrong. Given its goal, is the empirical strategy
credible? For instance, if we are looking at an RDD, ask yourself if there was good compliance
with the cutoff. If we are looking at a Diff-ion-Diff analysis, ask yourself if the parallelism
assumption is reasonable. Do the results seem robust? Sometimes people cherry-pick models to
get small p-values that favor their hypothesis (something known as p-hacking). Make sure you
look or at least ask about other model specifications (polynomials of higher order, interactions,
etc.). You should also ask whether confidence and significance are properly calculated, reported,
and interpreted. Remember that people have a difficult time interpreting confidence intervals.
The third step is to get under the hood and look at the regression equation. There is nothing
as frustrating as discussing the results of a regression without looking at the actual model used.
We must also know the exact method. There are multiple ways to fit a cloud with a linear
structure. Although they all resemble what we saw, they are not identical. There are methods like
Probit, Logit, or Maximum Likelihood. If you run into them and you have a chance, ask the
authors of the analysis how they think the results would differ if they used the standard method
of minimizing the square of the vertical distance, which is known as Ordinary Least Squares.
Ask how the data were treated or manipulated. For instance, how are categorical variables and
missing values treated?
You aren’t mathematicians, computer scientists or statisticians. You are economists—use
economic thinking to judge what makes sense.
6.3 Artificial intelligence, machine learning and bid data
In recent times, concepts like big data, artificial intelligence, and machine learning have
captured the imagination of many journalists and laypeople. Shouldn’t we analyze those concepts
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
72
instead of the method introduced by Galton over a century ago? The short answer is no. Those
concepts are building blocks that help—but don’t replace—the concepts you’ve learned in this
course. Think of the classic example of artificial intelligence in which a computer must distinguish
images of Chihuahua dogs and blueberry muffins. Computers are better than humans at telling
differences, or more generally, figuring out patterns (one they’ve been trained). But in the end,
they are nothing but fancy models of numerical prediction.
Additionally, there is issue of what question do we want to answer and whether we are
interpreting correctly the probabilistic aspects of it. Put simply, how would we use the ability to
tell Chihuahuas from blueberry muffins? I’m not trivializing the technological advance. But the
same can be said about previous advances—personal computers or the Internet. A lot of data and
huge computing power doesn’t necessarily mean better empirical analysis. If the analyst doesn’t
know what he or she is doing, it may be a bad thing. The methods and the data are the vehicle.
The most important aspect is knowing where we are going so that we can drive that vehicle to
our destination. It doesn’t really matter how nice the vehicle is, if we don’t know where we are
going, we might as well not go anywhere.
Machine learning is another concept that has caught on. It simply means that we substitute
what an analyst would do with an algorithm. The nice property is that the results produced feed
the algorithm. Imagine we are trying to maximize the opening rate of email newsletters. We can
pick the time of day for each person. How can we do that? We can design a sequence of RCTs
and measure what works best. An analyst would run each RCT, look at the results, and decide
whether to try a different time in the next RCT, and whether it should be earlier or later. By laying
these methods in an algorithm, we would be collecting information and accumulating actionable
information. That process would be autonomous. It could even be a permanent process,
continually trying new times—in case the schedule of preferences of subscribers changed. A lot
of common sense goes into the use of artificial intelligence and machine learning. They are
complements, not substitutes of your skills.
ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO
73
Another concept that has received attention is big data. We could say that it refers to the wealth
of information created in day-to-day transactions. You wake up in the morning. As you grab your
phone, there is a record of when you started looking at it. If you went to Instagram or the New
York Times app, you also leave records. What post you saw and which ones you liked, what
music you were listening, which emails you answered first, which ones you discarded without
opening. Then you take public transportation or a Divvy bike. That information is also stored.
When you scan your ID at your office, there is a record of when you arrived. The same for every
time you go in or out. At lunch, go to a restaurant and use a loyalty card. You go home and order
from Amazon. You also ask Alexa to play some music. You ask Waze directions to go to a friend’s
house or take Uber. You go home and stream a movie, leaving record of the shows you browsed.
And so on. That is without counting your performance measures at work, your grades at school,
your travel records, etc. In addition to that, you can take DNA tests.
DNA tests are particularly interesting because they pose the problem of false positives.
Imagine that we ask a large group of people with what statement they agree more: “I deeply
dislike Tom Brady” or “I am a fan of Tom Brady.” Let’s code it so that 1 means being a fan of the
New England Patriots quarterback. Assume we have binary information about 100 genes.
Remember that if we have 100 regressors, by sheer luck we can expect 5 of them to be significant
at 95% confidence. Similarly, we can expect one coefficient to be significant at 99% confidence.
We would call this the Brady-fan gene. However, that wouldn’t be solid evidence—to say the least.
Having many regressors brings the potential problem of false positives. If you dig enough,
you are always going to find a pattern that looks highly unlikely. Think of sports broadcaster
when they say “this is the first time in major league baseball that a lefty rookie pitcher has stroke
out three right-handed batters in a row after walking three left-handed batters in a post-season
game.” Did we just witness a highly unlikely, one-of-a-kind event? Or is it the case that, if we dig
enough, we’d find that in its own way everything is a first? Adding regressors to a model is like
dicing more finely the categories and therefore mechanically increasing the chances of finding
something “special.”
In sum, when you hear the terms artificial intelligence, machine learning or big data, keep in
mind that those concepts are not substitutes of the methods you learned in this course. Rather,
they are complements. If you have a good grasp of the concepts taught in this course, you will be
in a better position to make the most out of artificial intelligence, machine learning and big data.
Download