Uploaded by Dilsu Tanal

Lecture 4 - Univariate Linear Regression

advertisement
Univariate Linear Regression
Selcen Cakir
November 16, 2021
In most of econometrics, we are interested in answering causal questions. What is the
causal effect of studying Economics at Bogazici on an individual’s lifetime income? What
is the causal effect of having a new international trade partner on a country’s GDP? What
is the effect of recruiting more economists on a firm’s profits? The examples are numerous.
We have seen that to answer a causal question, we must make ceteris paribus, i.e., apples
to apples comparisons. One way to do this is to conduct randomized control trials (RCTs),
where we construct two samples that are on average identical except for the treatment status.
However, depending on the question, conducting RCTs maybe infeasible, too expensive, or
unethical. Therefore, we need to find ways to answer causal questions of interest with
observational data (rather than trying to get experimental data).
Sometimes we may want to think if we can use observational data to make ceteris paribus
comparisons. This could be possible, for example, if the selection bias between the treatment
and control groups can be captured by observable characteristics. For example, suppose
that we want to study the effect of studying economics at Bogazici on someone’s wages. We
know that a simple comparison of wages of people with and without a degree from Bogazici
economics will capture the treatment effect plus selection bias. Selection bias appears because
people who study at Bogazici economics are ranked very highly in the university entrance
exam, and comparing their outcomes to the outcomes of people who ranked lower is unlikely
to be an apples-to-apples comparison. To make an apples-to-apples comparison, we want to
compare individuals who have similar exam scores. In econometrics language, we say that
we want to control for characteristics that cause the selection bias. Regression is a valuable
tool for controlling for such characteristics.
Regression has another important advantage. If our treatment variable is continuous
rather than binary as in previous lectures, we can use regression analysis to find the treatment
effect. In addition to these, we will see that regression analysis has some really nice properties.
For example, we will call it BLUE, which means that it is the best linear unbiased estimator.
We will define what each of these words mean.
1
Figure 1
This note describes the mechanics of regression analysis when we have a treatment variable and no control variables. We call this a univariate regression. In a few weeks, we will
start learning about how to add control variables to our regression.
1
An example
Suppose that you are hired as an econometrician by the Ministry of Education to find the
causal effect of reducing the class size on students’ university entrance exam performance.1
You asked the Ministry to provide data on each school’s student-teacher ratio and average
test score per class. Then, you plotted those variables to obtain Figure 1.
Figure 1 shows that there is a lot of dispersion in the data, even for schools with the same
student-teacher ratio. This tells us that the student-teacher ratio is not a perfect predictor
of average test scores. Why so?
How can we summarize the relationship in the plot? We start by finding the Conditional
Expectation Function, shortly the CEF. We find the CEF by computing the average value of
the outcome variable for each possible value of the explanatory variable, as in Figure 2.
Let’s add the CEF to the scatterplot of our raw data, as in Figure 3. The CEF is a
reasonable predictor of the relationship between the class size and the average test score,
1
This note is based on SW Ch 4 and 5 and some other notes. Throughout the note, many parts of the
material are left incomplete to encourage the students to attend the class and read the book.
2
Figure 2
Figure 3
3
Figure 4
but it is difficult to express the CEF in a formal way. At this point, we want to know
the regression line, which is the best linear approximation of the nonlinear CEF. Check out
Figure 4, which replaces the CEF in Figure 3 with the regression line. See Figure 5 to have
a grasp of the relationship between the CEF and the regression line.
Unlike the CEF, the regression line is linear, so it expresses the relationship between the
class size and the test scores simply by 2 parameters, an intercept and a slope. Check out
Figure 6.
Our goal is to express these two parameters as a function of data, that is, to compute
the regression coefficients.
You want to quantify:
βClassSize
=
=⇒
∆T estScore
changeinT estScore
=
changeinClassSize
∆ClassSize
∆T estScore = β∆ClassSize.
(1.1)
Suppose that β = −2.28 What would be the predicted change in T estScore when you reduce
the class size by 2,3,...,n?
Equation 1.1 defines the slope of a straight line relating test scores and class size.
T estScore = α + βClassSize
4
(1.2)
Figure 5
Figure 6
5
However, we should not interpret equation 1.2 as an identity. At best, we can interpret it
as a statement about a relationship that holds on average across the population of districts.
What is the relevant population of interest here?
T estScore = α + βClassSize + otherf actors,
(1.3)
where other factors include everything else that affect the test scores (examples?), including
luck.
We can generalize equation 1.3 to many different examples and use the following notation:
yi = α + βxi + ui
(1.4)
for i = 1, 2, ..., n.
• y : dependent variable.
• x : independent variable.
• α + βxi : population regression line.
• α and β: Parameters of this regression line (intercept and slope).
• u : error term which incorporates all of the factors responsible for the difference between
the ith observation’s average test score and the value predicted by the population
regression line.
Equation 1.4 is the linear regression model with a single regressor, in which Y is the
dependent variable and X is the independent variable or the regressor. If you knew the
value of X in this regression, you would predict that Y is α + βX. Explain with a picture.
Other examples:
• Traffic fatalities as a function of penalties.
• Consumption as a function of income.
1.1
Estimation
We must use data to estimate unknown α and β. A “hat” on a variable throughout this note
means that we are talking about the estimated value of that variable. What are (α̂, β̂, û, ŷ)?
6
1.1.1
The Ordinary Least Squares Estimator
Consider the solution to
(α̂, β̂) = arg min
a,b
n
X
[yi − a − bxi ]2
(1.5)
i=1
We have 2 FOCs.
The FOC w.r.t. a is
0 = −2
n
X
[yi − a − bxi ]
i=1
n
1X
[yi − a − bxi ]
=
n i=1
n
=
n
n
1X
1X
1 X
xi
yi −
a− b
n i=1
n i=1
n i=1
1
na − bx̄
n
= ȳ − a − bx̄
= ȳ −
Solve for a to get
α̂ = ȳ − β̂ x̄
(1.6)
Notice that α
b is the value of α that minimizes the sum of squared residuals.
The second FOC is found by taking the derivative of equation 1.5 w.r.t b and setting this
7
to 0:
0
−2
=
n
X
[yi − a − bxi ]xi
i=1
n
X
[yi − (ȳ − bx̄) − bxi ]xi
=⇒
0=
=⇒
n
X
0=
[ỹi − bx̃i ]xi
i=1
i=1
n
X
[ỹi − bx̃i ](x̃i + x̄)
=⇒
0=
=⇒
n
n
X
X
0=
[ỹi − bx̃i ]x̃i +
[ỹi − bx̃i ]x̄
=⇒
n
n
n
X
X
X
0=
[ỹi − bx̃i ]x̃i + x̄
ỹi − bx̄
x̃i
i=1
i=1
=⇒
0=
i=1
i=1
n
X
i=1
i=1
[ỹi − bx̃i ]x̃i
i=1
Note that, for any variable,
n
X
z̃i =
i=1
n
X
(zi − z̄) =
i=1
n
X
zi − nz̄ = nz̄ − nz̄ = 0.
i=1
Solving for β̂, one gets
Pn
x̃i ỹi
β̂ = Pi=1
=
n
2
i=1 x̃i
Pn
1
n−1
1
n−1
Pi=1
n
x̃i ỹi
i=1
x̃2
=
σ̂xy
cov(X,
c
Y)
=
σ̂xx
vd
ar(X)
Note that one can take equation 1.4 and write it as
ỹi = β x̃i + ũi ,
set
n
X
β̂ = argmin
[ỹi − bx̃i ]2 ,
b
i=1
8
(1.7)
and get the same β̂.
The OLS predicted values of Ybi ’s and u
bi ’s are:
Ybi = α̂ + β̂Xi , i = 1, ..., n
u
bi = Yi − Ŷi , i = 1, ..., n
The OLS estimators are the sample counterparts of the population parameters. Let
d
T estScore
= 698.9 − 2.28ClassSize.
(1.8)
We say that an increase in ClassSize by 1 student is, on average, associated with a decline in
2.28 points decrease in TestScore. Suppose that in our sample, the ClassSize ranges from 30
to 150. Can you use these results to predict the TestScore when ClassSize is 5? Our sample
does not contain any observations with a ClassSize of 5. Thus, it would not be reliable to
predict the TestScore of a student who is in such a small class based on out sample regression.
1.1.2
Properties of Residuals
1. ui is the error, and
ûi = yi − ŷi
= yi − α̂ − β̂xi
6= ui .
is the residual.
2. Show that the sum of residuals is zero:
9
n
X
ûi
=
i=1
n
X
yi − ŷi
i=1
n
=
=
=
=
=
1X
n
yi − ŷi
n
i=1 n ȳ − ˆ¯yi
¯
¯
n ȳ − (α̂ + β̂ x̄) (by definition of ŷ)
n ȳ − α̂ + β̂ x̄
(because α̂ is a constant)
n ȳ − ȳ (by equation 1.6)
=0
¯ Hint: it follows from the first “F.O.C.”
3. Show that ȳ = ŷ.
4. Show that residuals are orthogonal to x’s, i.e.
n
X
ûi x̃i =
i=1
=
n
X
i=1
n
X
(ỹi − β̂ x̃i )x̃i
ỹi x̃i − β̂
n
X
i=1
=
=
n
X
i=1
n
X
x̃2i
i=1
Pn
n
ỹi x̃i X 2
i=1
x̃i
ỹi x̃i − Pn 2
i=1 x̃i i=1
ỹi x̃i −
i=1
n
X
i=1
= 0.
where
yi = α + βxi + ui
ȳ = α + β x̄ + ū
ỹi = β x̃i + ũi
10
ỹi x̃i
1.1.3
Goodness of Fit
Does our regressor account for much or for little of the variation in our dependent variable?
R2 provides an answer.
Are the observations tightly clustered around the regression line, or are they spread out?
The standard error of the regression provides an answer.
Let
Yi = Ŷi + ûi .
(1.9)
which implies that
n
n
X
X
2
(Ŷi + ûi − Ȳ )2
(Yi − Ȳ ) =
|
(1.10)
i=1
i=1
{z
}
TSS
=
n
X
2
(Ŷi − Ȳ ) +
i=1
|
n
X
û2i
(1.11)
i=1
{z
ESS
}
| {z }
SSR
because the explained and unexplained parts of Yi are orthogonal to each other. In words,
total variation in Y is equal to the sum of the explained variance and the residual variance.
R2 is the ratio of the sample variance of Ŷi to the sample variance of Yi :
R
where SSR =
Pn
i=1
2
Pn
2
i=1 (Ŷi − Ȳ )
P
=
n
2
i=1 (Yi − Ȳ )
ESS
=
T SS
SSR
= 1−
,
T SS
(1.12)
û2i .
0 ≤ R2 ≤ 1.
When is R2 = 1? When is R2 = 0?
The standard error of the regression (SER) is an estimator of the standard deviation of
the regression error ui . It is a measure of the spread of the observations around the regression
11
line. Because the regression errors u1 , u2 , ..., un are unobserved, the SER is computed using
their sample counterparts, the OLS residuals û1 , ..., ûn :
n
SSR
1 X 2
,
SER =
ûi =
n − 2 i=1
n−2
2
(1.13)
where the formula for variance SER uses the fact that the sample average of the OLS
residuals is zero. Note that we lost 2 degrees of freedom for estimating α and β. What is the
unit of SER?
1.2
OLS assumptions required for unbiasedness and consistency
1. Conditional independece
We assume that E[u|x] = 0, i.e, how u is distributed among observations is independent
of what x value each observation has. We can prove that E[u|x] = 0 =⇒ E[u] = 0
and cov(x, u) = 0 (homework).
• Other factors captured by ui sometimes lead to better test scores and sometimes
lead to worse test scores. But on average, ui is zero.
• We assume random assignment of Xi , i.e., X is distributed independently of all
other personal characteristics. Is this a good assumption for the observational
data?
2. Random sampling
(Xi , Yi ), i = 1, ..., n are independently and identically distributed across observations.
Cov(ui , uj ) = 0, ∀i 6= j. If the observations are drawn by simple random sampling from
a single larger population, then the i.i.d. assumption holds. Think of some example of
i.i.d. and non i.i.d samples of a population.
3. No outliers
V ar(ui ) = σ 2 < ∞. Large outliers are unlikely. Explain Figure 2.
12
Figure 7: Outliers
1.3
Finite Sample Properties of β̂
From now on, assume that xi , ∀i = 1, 2, ..., n are deterministic (i.e., we observe xi in our data
and treat them as deterministic variables instead of random ones.)
β̂ =
=
=
=
Pn
x̃i ỹi
Pi=1
n
x̃2
Pni=1 i
x̃i (β x̃i + ũi )
i=1P
n
x̃2i
Pn i=1
Pn
x̃2i
x̃i ũi
i=1
β Pn 2 + Pi=1
n
2
i=1 x̃i
i=1 x̃i
Pn
x̃i ũi
β + Pi=1
.
n
2
i=1 x̃i
b We will use this
So now we have obtained a new expression for the OLS estimator, β.
b
expression to derive the finite sample distribution, i.e., the mean and the variance, of β.
This way, we can prove whether the OLS estimator has some desired properties such as
unbiasedness and efficiency.
13
1.3.1
OLS estimator is unbiased
Pn
x̃i ũi
E β̂ = Eβ + E Pi=1
.
n
2
i=1 x̃i
Pn
i=1 x̃i E[ũi ]
= β+ P
n
2
i=1 x̃i
= β.
=⇒ β̂ is an unbiased estimator of β.
1.3.2
OLS estimator is efficient
b = E[βb − E β̂]2
V ar(β)
= E[βb − β]2
Pn
x̃i ũi
− β]2
= E[β + Pi=1
n
2
x̃
i=1 i
h Pn x̃ ũ i2
i i
= E Pi=1
n
x̃2
Pn i=1 i 2
E[ i=1 x̃i ũi ]
P
=
[ ni=1 x̃2i ]2
P P
E[ ni=1 nj=1 x̃i x̃j ũi ũj ]
P
=
[ ni=1 x̃2i ]2
Pn Pn
x̃i x̃j E ũi ũj ]
i=1
Pj=1
=
n
x̃2i ]2
[
Pn 2 i=1
2
i=1 x̃i σ
P
=
[ ni=1 x̃2i ]2
n
hX
i−1
= σ2
x̃2i
.
(1.14)
i=1
To see how we moved from 6th line to 8th line, suppose, for example, n = 3, and check
14
that
E[
3
X
x̃i ũi ]2 = E[(x̃1 ũ1 + x̃2 ũ2 ) + x̃3 ũ3 ]2
i=1
= E[(x̃1 ũ1 + x̃2 ũ2 )]2 + 2E[(x̃1 ũ1 + x̃2 ũ2 )x̃3 ũ3 ] + E[x̃3 ũ3 ]2
= E[x̃1 ũ1 ]2 + 2E[x̃1 ũ1 ][x̃2 ũ2 ] + E[x̃2 ũ2 )]2 + 2E[(x̃1 ũ1 + x̃2 ũ2 )x̃3 ũ3 ] + E[x̃3 ũ3 ]2
2
2
= x̃21 E[ũ21 ] +2x̃1 x̃2 E[ũ1 ũ2 ] +x̃22 E[ũ22 ] +2E[x̃1 ũ1 x̃3 ũ3 ] + 2E[x̃2 ũ2 x̃3 ũ3 ] + E[x̃3 ũ3 ]2
| {z }
| {z }
| {z }
0
2
σu
2
σu
2
2
= x̃21 E[ũ21 ] +2x̃1 x̃2 E[ũ1 ũ2 ] +x̃22 E[ũ22 ] +2x̃1 x̃3 E[ũ1 ũ3 ] +2x̃2 x̃3 E[ũ2 ũ3 ] +x̃23 E[ũ3 ]2
| {z }
| {z }
| {z }
| {z }
| {z }
| {z }
0
2
σu
=
3
X
2
σu
0
0
2
σu
x̃2i σu2
i=1
Note that
b declines as n increases.
1. V ar(β)
b declines as variation in x increases.
2. V ar(β)
b increases as σ 2 increases.
3. V ar(β)
1.3.3
Small sample distribution of βb
b If we also assume that ui ∼ iidN (0, σ 2 ), then,
We have derived the mean and variance of β.
since
Pn
x̃i ũi
β̂ = β + Pi=1
n
2
i=1 x̃i
is a linear combination of normal random variables, β̂ is normal
n
hX
i−1 2
=⇒ β̂ ∼ N β, σ
x̃2i
.
i=1
15
1.4
Large Sample (Asymptotic) Properties of β̂
Instead of assuming ui ∼ iidN (0, σ 2 ), assume n → ∞. Then, since β̂ − β is a weighted
average of ui , i = 1, 2, ..., n, where each ui has Eui = 0,
E[β̂ − β]
V ar[β̂ − β]
=
0,
=
2
σ
n
hX
x̃2i
i−1
→ 0,
i=1
√
E n[β̂ − β]
√
V ar[ n(β̂ − β)]
=
=
0,
nσ
2
n
hX
x̃2i
i−1
=σ
2
n
h1 X
i=1
(CLT ) =⇒
In either case, since
involving β.
2
√
√
n(β̂ − β) ∼ N 0, σ
x̃2i
n i=1
n
h1 X
2
n
i−1
x̃2i
,
i−1 .
i=1
n(β̂ − β) has a known distribution, one can construct hypothesis tests
Hypothesis Tests and Confidence Intervals
Consider
H0 : β = β0 vs. HA : β 6= β0 .
Under the null hypothesis,
√
n(β̂ − β0 )
√
∼ N (0, 1),
σβ / n
where
σ
σβ = pPn
2
i=1 x̃i
,
or
√
n(β̂ − β0 )
∼ tn−2 ,
σ̂β
16
where
σ̂β2 =
s2 =
1
n
s2
Pn
x̃2i
n
X
,
i=1
1
n−2
û2 .
i=1
Why n − 2?
We say that β is (statistically) significant if β̂ is significantly different from β0 = 0.
We can also construct confidence intervals for β which give us the interval that contains
the true value of β with some probability. A 95% confidence interval contains the true value
of β in 95% of all samples.
95% confidence interval for β = [β̂ − 1.96σ̂β̂ , β̂ + 1.96σ̂β̂ ]
(2.1)
What is the 95% confidence interval for β∆x?
2.1
Regression when X is a binary variable
Example:

1 if student i is at BoUn
Di =
0 if student i is not at BoUn
Other examples: Female, Urban, CollegeGraduate, etc.
A binary variable is also called an indicator variable or a dummy variable.
Yi = α + βDi + ui
We cannot think of β as a slope anymore (why?). Notice that
E[Yi |Di = 1] = α + β
(2.2)
E[Yi |Di = 0] = α
(2.3)
(2.4)
17
Figure 8: How data looks like with a binary regressor
So, we have
β = E[Yi |Di = 1] − E[Yi |Di = 0].
(2.5)
β is the difference between the sample averages of Yi in the two groups. See Figure 3.
2.2
Heteroskedasticity and Homoskedasticity, Gauss-Markov Theorem, and the WLS
So far, what we have assumed about the error terms is that E[u|x] = 0. If we also assume
that V ar(u|x) = σ 2 , i.e., the variance of this conditional distribution does not depend on X,
then the errors are called homoskedastic. Otherwise, the error term is heteroskedastic.
W agei = α + βF emalei + ui
(2.6)
Assuming homoskedasticity in this example amounts to assuming that “the variance of wages
is the same for men as it is for women.” Another example:
W agei = α + βT eacheri + ui
18
(2.7)
Figure 9: How data looks like with a binary regressor and heteroskedastic error terms
where T eacher = 1 if a person’s occupation is being a high school teacher in Turkey and
T eacheri = 0 for all other occupations. Is assuming homoskedasticity plausible here? When
homoskedasticity assumption fails, the OLS estimator remains unbiased, consistent, and
asymptotically normal. However, failure of the homoskedasticity assumption causes the
variance of the OLS estimator to go up. Thus, heteroskedasticity makes the OLS estimator
less efficient.
Indeed, the Gauss-Markov theorem states that, when the previously stated three
assumptions hold and if the errors are homoskedastic, then the OLS estimator β̂ is the Best
(most efficient) Linear Conditionally Unbiased Estimator ( BLUE).
Econometricians have developed methods for dealing with heteroskedastic standard errors. For example, if the nature of heteroskedasticity is known, i.e., if var(ui |Xi ) is known up
to a constant factor of proportionality, we can do weighted least squares (WLS) estimation, where the ith observation is weighted by the inverse of the square root of var(ui |Xi ).
The errors in this weighted regression becomes homoskedastic (show how), so the OLS procedure continues to be BLUE.
19
Download