Uploaded by Anand Nair

Trics-I Fall2020 LectureNotes

advertisement
Econometrics - I
Manini Ojha
JSGP
Fall, 2020
Manini Ojha (JSGP)
EC-14
Fall, 2020
1 / 237
Introduction - Econometrics
The term econometrics is attributed to Frisch; 1969 Nobel prize
co-winner
Haavelmo (1944) states “The method of econometric research aims,
essentially, at a conjunction of economic theory and actual
measurements, using the theory and technique of statistical inference
as a bridge pier.”
Frisch
Manini Ojha (JSGP)
Haavelmo
EC-14
Fall, 2020
2 / 237
“[E]conometrics is by no means the same as economic statistics. Nor is it
identical with what we call general economic theory, although a
considerable portion of this theory has a definitely quantitative character.
Nor should econometrics be taken as synonymous with the application of
mathematics to economics. Experience has shown that each of these three
view-points, that of statistics, economic theory, and mathematics, is a
necessary, but not by itself a sufficient, condition for a real understanding
of the quantitative relations in modern economic life. It is the unification of
all three that is powerful. And it is this unification that constitutes
econometrics.”
Frisch, Econometrica, 1933, volume 1, pgs. 1-2.
Manini Ojha (JSGP)
EC-14
Fall, 2020
3 / 237
Econometric methods
Useful in estimating economic relationships between variables
Theory usually suggests the direction of change in one variable when
another variable changes
Simplification of reality
Relationships b/w variables in practice are
not exact and one must incorporate stochastic elements into the model
Incorporating stochastic elements transforms the theory from one of
an exact statement to a probabilistic description about expected
outcomes
Probability or stochastic approach to econometrics dates to Haavelmo
(1944); 1989 Nobel Prize winner
Manini Ojha (JSGP)
EC-14
Fall, 2020
4 / 237
1
2
Used to test simple economic theories (demand, supply, business
cycles, individual decisions etc.)
Used to evaluate and implement govt and business policy
Effectiveness of government programs (job-training, mid-day meals,
employment creation, skill-acquisition).
Determine effect on hourly wages, school performance, attendance etc.
Empirical analysis helps us make empirical judgements as opposed to
moral judgements
Empirical analysis uses data to test a theory or estimate a relationship
Manini Ojha (JSGP)
EC-14
Fall, 2020
5 / 237
Data used can be experimental or non-experimental data
Experimental data is collected in a lab or field
Non-experimental is observational data. Researcher is a passive
collector
Manini Ojha (JSGP)
EC-14
Fall, 2020
6 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it
Construct an econometric model
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it
Construct an econometric model
State hypotheses of interest in terms of unknown parameters
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it
Construct an econometric model
State hypotheses of interest in terms of unknown parameters
Obtain relevant data for analysis
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Steps in empirical/econometric analysis
Formulate an interesting question
Construct a formal economic model (especially for testing theories demand equation) OR rely on intuition/reasoning to construct it
Construct an econometric model
State hypotheses of interest in terms of unknown parameters
Obtain relevant data for analysis
Use econometric methods to estimate the parameters in the model
and test hypotheses
Manini Ojha (JSGP)
EC-14
Fall, 2020
7 / 237
Empirical/econometric analysis
Econometric analysis begins where economic theory concludes
An econometric model starts with a statement of a theoretical
proposition
Goal is to test this proposition through the statistical analysis of data
Econometrics requires: theory =⇒ data =⇒ methods
Manini Ojha (JSGP)
EC-14
Fall, 2020
8 / 237
Example
You are a labor economist examining the effects of a job training
program on worker’s productivity. Simple intuition/reasoning suggests
that factors like education, experience, training would affect
productivity.
wage here is the dependent variable (y )
educ, exper , training the control/explanatory/independent variables
(x 0 s).
Workers are paid wages according to their productivity
Manini Ojha (JSGP)
EC-14
Fall, 2020
9 / 237
From economic analysis to econometric analysis
Simple reasoning such as this leads to a model:
(1)
wage = f (educ, exper , training )
wage =β0 + β1 educ + β2 exper + β3 training + u
(2)
wage = hourly wage
educ = years of formal education
exper = years of workforce experience
training = weeks spent in job training
A specific functional form must be specified for econometric analysis
From Eqn. (1) to Eqn. (2)
Manini Ojha (JSGP)
EC-14
Fall, 2020
10 / 237
There may be several unobservables affecting wages of a worker that
we cannot measure
Incorporate stochastic elements/unobserved factors: u
Add other variables to Eqn. (2) like family income, parent’s education,
age, etc. to reduce u but can never eliminate entirely
Other factors like “innate ability”, “quality of education”, “family
background” etc that influence a person’s wage that are included in u
Dealing with this error term or disturbance term u , is the most
important component of econometric analysis
Manini Ojha (JSGP)
EC-14
Fall, 2020
11 / 237
In Eqn. (2):
β0, β1, β2, β3 are parameters of the econometric model
Describe the directions and strengths of relationship between wage and
the factors used to determine wage/productivity
If the question of interest is how training affects wage, we are
particularly interested in β3
β3 is then referred to as the parameter of interest
Hypotheses of interest can then be stated in terms of unknown
parameter:
β3 = 0
Hypothesizing that job training has no effect on wages
Manini Ojha (JSGP)
EC-14
Fall, 2020
12 / 237
Now, we proceed to obtain relevant data on education, experience,
training, wages
Then estimate the parameters in the model specified in Eqn. (2)
Formally test our hypothesis of interest
Manini Ojha (JSGP)
EC-14
Fall, 2020
13 / 237
Types of data
Cross-sectional data
Manini Ojha (JSGP)
EC-14
Fall, 2020
14 / 237
Types of data
Cross-sectional data
Time series data
Manini Ojha (JSGP)
EC-14
Fall, 2020
14 / 237
Types of data
Cross-sectional data
Time series data
Pooled-cross-sections
Manini Ojha (JSGP)
EC-14
Fall, 2020
14 / 237
Types of data
Cross-sectional data
Time series data
Pooled-cross-sections
Panel/longitudinal data
Manini Ojha (JSGP)
EC-14
Fall, 2020
14 / 237
Cross-section data
Cross-sectional data consists of data on multiple agents or a sample
of individuals, households, cities, states, countries, firms or a variety of
other units at a single point in time t = t0 .
{xi , yi }N
i=1
x = years of schooling ,y = wage
In pure cross-section, ignore any minor differences in timing of survey
(e.g. households interviewed on different days or weeks)
Assume that data have been obtained by a random sampling from the
underlying population
e.g. information on wages, education, experience etc have been
obtained from a random draw of 500 people of all working population
Manini Ojha (JSGP)
EC-14
Fall, 2020
15 / 237
Obs
1
2
3
.
.
.
555
wage
190
75
100
.
.
.
80
educ
7
11
8
.
.
.
16
exper
2
7
41
.
.
.
15
female
1
0
0
.
.
.
0
married
0
1
0
.
.
.
1
The table contains data on 555 individuals and their wages, educ
(number of years of education), exper (number of years of labour
force experience), female (binary indicator for gender), married (binary
indicator for marital status).
Manini Ojha (JSGP)
EC-14
Fall, 2020
16 / 237
The matrix of cross-section data can be represented as
x11 x12 x13 ...................... x1N
x21 x22 x23 ...................... x2N
x31
.
.
......................
.
.
.
.
......................
.
.
.
.
......................
.
.
.
.
.......................
.
xN1 xN2 xN3 ........................ xNN
Manini Ojha (JSGP)
EC-14
Fall, 2020
17 / 237
Time series data
Time series data is data on a single agent at multiple points in time
{xt , yt }T
t=1
x = real interest rate,y = investment
Example: stock prices, GDP, inflation, unemployment, homicide rates,
etc
Here, past events can influence future events
Thus, time series data is rarely ever assumed to be independent across
time
Manini Ojha (JSGP)
EC-14
Fall, 2020
18 / 237
Special attention given to data frequency here
GDP recorded quarterly
Inflation recorded monthly
Stock prices recorded daily
Unemployment recorded monthly
Monthly, daily, quarterly, etc show strong seasonal patterns that are
important considerations in time series data
Manini Ojha (JSGP)
EC-14
Fall, 2020
19 / 237
The matrix of time series data would look like
x11 x12 x13 ...................... x1N
x21 x22 x23 ...................... x2N
x31
.
.
......................
.
.
.
.
......................
.
.
.
.
......................
.
xt1 xt2 xt3 ...................... xtN
.
.
.
......................
.
.
.
.
.......................
.
xT 1 xT 2 xT 3 ........................ xTN
Manini Ojha (JSGP)
EC-14
Fall, 2020
20 / 237
Table below shows a time series data on minimum wages and GNP (in
millions of 1954 dollars) in Puerto Rico [Castillo-Freeman and
Freeman (1992)]
Obs
1
2
3
.
.
.
40
Manini Ojha (JSGP)
year
1950
1951
1962
.
.
.
1987
avgmin
0.20
0.21
0.23
.
.
.
3.35
EC-14
prunemp
15.4
16.0
14.8
.
.
.
16.8
prgnp
878.7
925.0
1015.9
.
.
.
4496.7
Fall, 2020
21 / 237
Panel/longitudinal data
Panel data is data on multiple agents at multiple points in time
{xit , yit }i=1,2,....,N; t=1,2,.....,T
x = income inequality in country i in time t;y = growth rate
A panel consists of a time series for each cross-sectional member in
the data set.
For example
wage, educ, exper for a same set of individuals over a period of 10 years
investment or financial data for a same set of firms over a period of 5
years
Note: panel follows the same cross-sectional units over time
i.e. follows the same set of households over time
Manini Ojha (JSGP)
EC-14
Fall, 2020
22 / 237
The table below shows a two year panel on crime and related statistics
for 150 cities in the US
Manini Ojha (JSGP)
EC-14
Fall, 2020
23 / 237
Pooled cross-section
The difference between pooled cross-section and panel data is that the
agents in each cross sections can differ
Also referred to as repeated cross-sections
e.g. 2 cross-sectional data of household surveys in India (NSS)
not the same households
Let us look at data on effect of property taxes on house prices
a random sample of house prices in 1993
a new random sample of house prices in 1995
Manini Ojha (JSGP)
EC-14
Fall, 2020
24 / 237
Pooled cross-section
Observations 1 to 250 correspond to houses sold in 1993 and 251 to
520 correspond to houses sold in 1995
Manini Ojha (JSGP)
EC-14
Fall, 2020
25 / 237
Pooled cross-section versus panel data
Panel data requires replication of the same
units/agents/individuals/households overtime whereas
pooled/repeated cross-sections do not require the same agents
Thus, panel especially on households, individuals, firms, etc are more
difficult to obtain
e.g. IHDS for India
Observing the same units over time leads to advantages over
cross-sectional or pooled cross-sectional data
multiple observations on the same units allows us to control for
unobserved characteristics of individuals, firms or households etc
aids in causal inferences
allows us to study the importance of lags in behaviour (many economic
policies have an impact only after some time has passed)
Manini Ojha (JSGP)
EC-14
Fall, 2020
26 / 237
Causality and Ceteris Paribus
Most econometric analysis concerns itself with tests of economic
theory or evaluation of policy
goal is usually to infer the effect of one variable on another
does education have a causal effect on wages?
finding an association between variables is only suggestive, but
establishing a causality is compelling
Recall the notion of ceteris paribus: effect of one variable on another
holding everything else constant/‘other things equal’
effect of changing the price of a good on its quantity demanded while
holding other factors like income, prices of other goods, tastes etc fixed
Manini Ojha (JSGP)
EC-14
Fall, 2020
27 / 237
Notion of causality is similar
critical for policy analysis
does one week of job training, while all other factors are held constant,
improve worker’s productivity and in-turn wages?
if we succeed in holding all other relevant factors fixed, we can find a
causal impact of job training on wages
This is a difficult task
Key question in most econometric/empirical research is thus
Have enough other factors been held fixed to make a case for
causality?
Manini Ojha (JSGP)
EC-14
Fall, 2020
28 / 237
Example
Measuring returns to education
Economists are often interested in answering the question
If a person is chosen from a population and given one more year of
education, by how much will his/her wage increase?
Implicit assumption here is: holding everything else (family background,
intelligence etc.) constant
Manini Ojha (JSGP)
EC-14
Fall, 2020
29 / 237
Problem
Experiment:
Choose a group of people, randomly assign different amounts of
education to them (infeasible!), compare wages
If levels of education are assigned independently of other factors, then
the expt. ignores that these other factors will yield useful results
Non-experimental data
Problem in non-experimental data for a large sample is that people
choose their levels of education
This means education levels is not determined independently of other
factors that affect wages
eg: people with higher innate ability choose higher levels of education.
Higher innate ability leads to higher wages, thus there is a correlation
between education levels and a critical factor that affects wages.
Another problem is omitted variables
Difficult to measure a lot of factors that may affect wages like innate
ability. Thus ceteris paribus in true sense is difficult to justify.
Manini Ojha (JSGP)
EC-14
Fall, 2020
30 / 237
Classical Linear Regression Model (CLRM)
Interested in the relationship between two variables, Y and X
Scatterplots are typically a good way of examining the distribution
(dbn), F(X,Y)
Definition: A scatterplot is a plot with each data point appearing once. In
econometrics, the variable on the y-axis is the dependent variable and the variable
on the x-axis is the independent variable.
Manini Ojha (JSGP)
EC-14
Fall, 2020
31 / 237
Scatterplots are informative, but messy to present.
An alternative is to summarize the information contained in the
scatterplot
Can be done by finding the ‘best’ line in terms of fitting the data, and
then reporting the intercept and the slope of the line
Most common means of accomplishing this is known as regression
analysis
Manini Ojha (JSGP)
EC-14
Fall, 2020
32 / 237
Terminology: simple regression
Y
Dependent variable
Explained variable
Response variable
Predicted variable
Regressand
Manini Ojha (JSGP)
X
Independent variable
Explanatory variable
Control variable
Predictor variable
Regressor
Covariate
EC-14
Fall, 2020
33 / 237
An Aside:
The term regression was originated by Galton (1886)
Sir Francis Galton also introduced the concepts of correlation,
standard deviation and bivariate normal dbn
Manini Ojha (JSGP)
EC-14
Fall, 2020
34 / 237
Simple Linear Regression
We are interested in the relationship between x and y
Explaining y in terms of x or
Studying how y varies with changes in x
Questions:
Since there is never an exact relationship between two variables, how
do we allow for other factors to affect y ?
What is the functional relationship between y and x?
How can we be sure we are capturing a ceteris paribus relationship
between y and x (if that is the goal)?
Manini Ojha (JSGP)
EC-14
Fall, 2020
35 / 237
Resolve the ambiguities by writing down an equation relating y to x
(3)
y = β0 + β1 x + u
This equation which is assumed to hold in the population of interest
and defines the simple linear regression (SLR) model
u is the error/disturbance term reflecting factors other than x that
affect y
treated effectively as unobserved
SLR model is a population model
When it comes to estimating β0 and β1 ,
we use a random sample of data and
we must restrict how u and x are related
Manini Ojha (JSGP)
EC-14
Fall, 2020
36 / 237
Example of simple linear regression
The following is a model relating a person’s wage to observed
education level and other unobserved factors
wage = β0 + β1 educ + u
wage is measured in dollars per hour
educ is measured in years of education
β1 measures the change in hourly wage given another year of
education holding all other factors fixed
Manini Ojha (JSGP)
EC-14
Fall, 2020
37 / 237
Linearity of Eqn.3 =⇒ each unit increase in x has the same effect on
y regardless of where x starts from
may not be realistic (allow for increasing returns ... later)
Does the model in Eqn. 3 really lead to ceteris paribus conclusions
about how x affects y ?
Need to make certain assumptions to estimate a ceteris paribus effect
Manini Ojha (JSGP)
EC-14
Fall, 2020
38 / 237
Assumptions of CLRM
Mean of the unobserved factors is zero in the population
E (u) = 0
For wage-education example, this means we are assuming that things
such as average ability are zero in the population of all working people
Manini Ojha (JSGP)
EC-14
Fall, 2020
39 / 237
u and x are uncorrelated
E (u|x) = E (u)
Average value of u does not depend on the value of x and is equal to
average of u over the entire population
Or u is mean independent of x
For wage-education example, this implies that the average ability for a
group of people from the population with 8 years of education is the
same as the average ability for a group of people from the population
with 16 years of education
Manini Ojha (JSGP)
EC-14
Fall, 2020
40 / 237
Combining both
E (u|x) = 0
Called zero conditional mean assumption
Implication:
y = β0 + β1 x + u
E (y |x) = β0 + β1 x + E (u|x)
E (y |x) = β0 + β1 x
(PRF )
which shows that E (y |x) is a linear function of x
=⇒ 1 unit increase in x changes the expected value of y by the
amount β1
Tells us about how the average value of y changes with x , not
how y changes with x for all units of the population!
Manini Ojha (JSGP)
EC-14
Fall, 2020
41 / 237
Example
Suppose that x is high school GPA and y is college GPA
We are given that
E (colGPA|hsGPA) = 1.5 + 0.5hsGPA
Suppose hsGPA = 3.6, then
E (colGPA|hsGPA) = 1.5 + 0.5(3.6) = 3.3
=⇒ Avg. collGPA for all high school graduates who attend college with
a high school GPA is 3.3
Does not mean every student with hsGPA = 3.6 will have college GPA
of 3.3
Some will have 3.3, some will have more, some less
On average, colGPA = 3.3
Manini Ojha (JSGP)
EC-14
Fall, 2020
42 / 237
Going back to Eqn.3 (y = β0 + β1 x + u) with this information, we can
divide Eqn.3 into 2 parts
y = E (y |x) + u
(4)
= β0 + β1 x + u
Systematic part: β0 + β1 x
represents E (y |x) which is the part of y explained by x
Stochastic/unsystematic part: u
part of y not explained by x
Manini Ojha (JSGP)
EC-14
Fall, 2020
43 / 237
From PRF to SRF
Analogous to population regression function (PRF) is the concept of
sample regression function (SRF) which represents the sample
regression line
Let us take a random sample of size n from the population
Let the random sample be {(xi , yi ) : i = 1, ......, n}, then we can
express the sample counterpart of E (y |x) in Eqn. 4 as:
(5)
ŷi = β̂0 + β̂1 xi
ŷi is the estimator of E (y |x); β̂0 and β̂1 are estimators of β0 and β1
respectively
Similarly: ûi can be regarded as an estimate of ui
Manini Ojha (JSGP)
EC-14
Fall, 2020
44 / 237
Our primary objective in regression analysis is to estimate the SRF
because more often than not analysis is based on a sample from the
population
because of sampling fluctuations, this estimate is at best an
approximate one
Manini Ojha (JSGP)
EC-14
Fall, 2020
45 / 237
For x = xi , in terms of sample regression, observed yi can be
expressed as
yi = ŷi + ûi
(6)
In terms of population, it can be observed as
yi = E (y |xi ) + ui
Manini Ojha (JSGP)
EC-14
Fall, 2020
46 / 237
From 6
ûi = yi − ŷi
(7)
= yi − β̂0 − β̂1 xi
Residuals ûi is thus the difference between actual and estimated y
values
To achieve our goal of choosing the best estimates,
P let us
P say we
choose SRF such that the sum of the residuals
ûi = (y − ŷi ) is
as small as possible
Manini Ojha (JSGP)
EC-14
Fall, 2020
47 / 237
Intuitively appealing but not a very good criterion
P
If we minimize
ûi , given that we give equal importance to all
residuals no matter how close or widely scattered the individual
observations are from the SRF
it is quite possible from the scatterplot above that the algebraic sum be
zero
Manini Ojha (JSGP)
EC-14
Fall, 2020
48 / 237
CLRM: Ordinary Least Squares Estimates (OLS)
OLS refers to a choice of parameters β0 and β1 in Eqn.3
Question: How does one estimate parameters β0 and β1 ?
Answer: Minimize the distance b/w the data points and the line
Manini Ojha (JSGP)
EC-14
Fall, 2020
49 / 237
OLS
We use the least-squares criterion/ordinary least squares (OLS)
and minimize the sum of squared residuals
X
X
ûi2 =
(yi − β̂0 − β̂1 xi )2
P
Define S = ûi2
OLS implies
min S ⇒
β̂0 ,β̂ 1
∂S
∂ β̂0
=
∂S
∂ β̂1
=0
Do it!
Manini Ojha (JSGP)
EC-14
Fall, 2020
50 / 237
OLS
β̂0∗ ......
n
∂ X
=
[ (yi − β̂0 − β̂1 xi )2 ]
∂ β̂0
∂ β̂0 i=1
∂S
implies
−2
X
(yi − β̂0 − β̂1 xi ) = 0
i
X
i
yi −
X
β̂0∗ − β̂1
X
i
xi = 0
i
nȳ − nβ̂0∗ − β̂1 nx̄ = 0
implies
β̂0∗ = ȳ − β̂1 x̄
(8)
where ȳ is the sample average of yi and likewise for x̄.
Manini Ojha (JSGP)
EC-14
Fall, 2020
51 / 237
OLS
β̂1∗ .......
∂S
∂ β̂1
=
n
∂ X
[ (yi − β̂0 − β̂1∗ xi )2 ]
∂ β̂1 i=1
=
n
∂ X
[ (yi − ȳ + β̂1∗ x̄ − β̂1∗ xi )2 ]
∂ β̂1 i=1
n
∂ X
=
[ ((yi − ȳ ) − β̂1∗ (xi − x̄))2 ]
∂ β̂1 i=1
Manini Ojha (JSGP)
EC-14
Fall, 2020
52 / 237
implies
−2
X
[(yi − ȳ ) − β̂1∗ (xi − x̄)](xi − x̄) = 0
i
known as the least squares normal equation
P
(yi − ȳ )(xi − x̄)
∗
β̂1 = i P
2
i (xi − x̄)
Manini Ojha (JSGP)
EC-14
(9)
Fall, 2020
53 / 237
Eqn. 9 gives us the slope parameter and using the slope estimate, it is
straightforward to obtain the intercept estimate.
Note that
β̂1∗
P
− ȳ )(xi − x̄)
2
i (xi − x̄)
sample coviance between x and y
=
sample variance of x
cov (xi , yi )
=
var (xi )
=
i (y
Pi
∴ If x and y are positively correlated in the sample, β̂1 > 0; if x and y
are negatively correlated , then β̂1 < 0 is negative
Manini Ojha (JSGP)
EC-14
Fall, 2020
54 / 237
Thus, Eqn. 8 and Eqn. 9 give us the OLS estimates
The name OLS (ordinary least squares) comes from the fact that we
obtain these estimates by minimizing the sum of squared residuals
Manini Ojha (JSGP)
EC-14
Fall, 2020
55 / 237
Once we have determined the OLS intercept and slope estimates, we
form the OLS regression line
(10)
ŷ = β̂0 + β̂1 x
β̂0 is the predicted value of y when x = 0
(although in some cases, it will not make sense to set x = 0)
In most cases, we can write the slope estimate as
β̂1 =
4ŷ
4x
This is usually of primary interest and tells us how ŷ changes when x
increases by a unit
Manini Ojha (JSGP)
EC-14
Fall, 2020
56 / 237
Terminology - regression
When we run an OLS regression, without writing out the equation, we
say we
run the regression of y on x
regress y on x
Manini Ojha (JSGP)
EC-14
Fall, 2020
57 / 237
Properties of OLS statistics
Sum and thus the sample average of OLS residuals is zero
n
X
ûi = 0
(11)
i=1
Sample covariance between between the regressors and the OLS
residuals is zero.
n
X
xi ûi = 0
(12)
i=1
The point (x̄i , ȳi ) is always on the OLS regression line.
ȳ = β̂0 + β̂1 x̄
Manini Ojha (JSGP)
EC-14
Fall, 2020
58 / 237
OLS Implication
OLS formula for β̂1∗ uses the optimal choice of β̂0∗
β̂0∗ = ȳ − β̂1∗ x̄
If xi = x̄ ∀i, implying that var (x) = 0, then β̂1∗ is undefined. Thus,
var (x) 6= 0 is an identification condition
Sample average of the fitted values ŷi is the same as the sample
average of the true yi (follows from Eqn. 6)
ȳ = ŷ¯i
Manini Ojha (JSGP)
EC-14
Fall, 2020
59 / 237
Terminology
Total sum of squares: measure of total sample variance in yi
SST =
n
X
(yi − ȳ )2
i=1
Explained sum of squares: measure of sample variance in ŷi (given
that ȳ = ŷ¯i )
n
X
SSE =
(ŷi − ȳ )2
i=1
Residuals sum of squares: measure of sample variance in ûi
SSR =
n
X
ûi2
i=1
Can write SST = SSE + SSR
Manini Ojha (JSGP)
EC-14
Fall, 2020
60 / 237
Proof: decompose yi for each observation into 2 components
(explained and unexplained)
yi − ȳ = (yi − ŷi ) + (ŷi − ȳ )
Unexplained part is the residual ûi = yi − ŷi
With some algebra
X
X
(yi − ȳ )2 =
[(yi − ŷi ) + (ŷi − ȳ )]2
X
=
[ûi + (ŷi − ȳ )]2
X
X
X
=
ûi2 +
(ŷi − ȳ )2 + 2
ûi (ŷi − ȳ )
X
= SSR + SSE + 2
ûi (ŷi − ȳ )
(14)
SST = SSR + SSE
X
∵
ûi (ŷi − ȳ ) = 0
Manini Ojha (JSGP)
EC-14
(13)
Fall, 2020
61 / 237
This also implies
P
ûi2
(ŷi − ȳ )2
P
1= P
+
(yi − ȳ )2
(yi − ȳ )2
SSE
SSR
+
1=
SST
SST
P
Coefficient of determination, R 2 of the regression is defined as
R2 =
Manini Ojha (JSGP)
SSR
SSE
=1−
SST
SST
EC-14
(15)
Fall, 2020
62 / 237
This is the fraction of total variation that is explained by the model
i.e. fraction of variation in y that is explained by x
R 2 is always between 0 and 1 because SSE can be no greater than
SST
R 2 = 1 means all data points lie on the same line and OLS provides a
perfect fit to the data
R 2 = 0 means a poor fit of the line
Interpreted usually by multiplying by 100 to change into percentage
terms
Manini Ojha (JSGP)
EC-14
Fall, 2020
63 / 237
Examples
CEO salary and return on equity
\ = 963.191 + 18.501roe
salary
n = 209, R 2 = 0.0132
Here, the regression only explains 1.3% of the total variation in CEO’s
salary
Voting outcomes and campaign expenditures
\ = 26.81 + 0.464shareA
voteA
n = 173, R 2 = 0.856
Here, the regression explains 85% of the total variation in election
outcomes
Note: High R 2 does not necessarily mean regression has a causal
interpretation. Low R 2 is infact quite common in social sciences,
especially for cross-sectional analysis
Manini Ojha (JSGP)
EC-14
Fall, 2020
64 / 237
Functional Form
Common specifications of the functional forms to incorporate
non-linearities look like:
y = β0 + β1 x + u
4y = β1 4x
Log-level specification
log (y ) = β0 + β1 x + u
%4y = (100β1 )4x
Log-log specification
log (y ) = β0 + β1 log (x) + u
%4y = β1 %4x
Level specification
Manini Ojha (JSGP)
EC-14
Fall, 2020
65 / 237
Log - level specification
Regression of log wages on years of education
log (wage) = β0 + β1 educ + u
Here, the interpretation of the regression coefficient is
∂wage
∂log (wage)
1 ∂wage
wage
β1 =
=
.
=
∂educ
wage ∂educ
∂educ
i.e. %4in wages from 1 unit (year) increase in education
(semi-elasticity)
Manini Ojha (JSGP)
EC-14
Fall, 2020
66 / 237
Log-log specification
Regression of CEO salary on firm sales
log (salary ) = β0 + β1 log (sales) + u
Here, the interpretation of regression coefficient is
β1 =
∂log (salary )
=
∂log (sales)
∂salary
salary
∂sales
sales
i.e. %4in salary from a 1% increase in sales (constant-elasticity)
Manini Ojha (JSGP)
EC-14
Fall, 2020
67 / 237
Gauss-Markov assumptions for SLR
(SLR.1) Linearity in parameters: in the population, y is related to x
and u as
y = β0 + β1 x + u
where β0 , β1 are the population intercept and population slope parameters
respectively
(SLR.2) Random sampling: we have a random sample of size
n,{(xi , yi ) : i = 1, 2, ..., n} following the population model above
In terms of random sample, above equation can be written as
yi = β0 + β1 xi + ui
Manini Ojha (JSGP)
EC-14
Fall, 2020
68 / 237
(SLR. 3) Sample variation in the explanatory variable: the sample
outcomes on x, namely {xi , i = 1, 2, ...., n}, are not all the same value
This is a weak assumption - only says that if x varies in the population,
then x varies in the random sample as well unless population variation is
minimal or sample is small.
(SLR. 4) Zero conditional mean:E (u|x) = 0. This assumption
coupled with random sampling implies
E (ui |xi ) = 0 for all i
i.e. the value of the explanatory variable must contain no information
about the mean of the unobserved factors
Manini Ojha (JSGP)
EC-14
Fall, 2020
69 / 237
Note:
SLR.4 in conjunction with SLR.3 allows for a technical simplification. In
particular, we can derive the statistical properties of the OLS estimators as
conditional on the values of xi in our sample.
Technically, in statistical derivations, conditioning on the sample values of
the independent variables x 0 s is the same as treating the xi as fixed in
repeated samples.
Manini Ojha (JSGP)
EC-14
Fall, 2020
70 / 237
Unbiasedness of OLS estimators
What qualities do OLS estimators possess?
Recall that we briefly touched upon this last semester in Stats- I
Using SLR. 1 - SLR. 4, we can show that the OLS estimators are
unbiased
E (β̂0 ) = β0 ; E (β̂1 ) = β1
Proof
P
P
− ȳ )(xi − x̄)
yi (xi − x̄)
β̂1 =
= Pi
2
2
(xi − x̄)
i (xi − x̄)
P i
(xi − x̄)(β0 + β1 xi + ui )
P
= i
2
i (xi − x̄)
i (y
Pi
Manini Ojha (JSGP)
EC-14
Fall, 2020
71 / 237
Numerator can be written as
X
X
X
(xi − x̄)β0 +
(xi − x̄)β1 xi +
(xi − x̄)ui
i
= β0
i
X
i
X
X
(xi − x̄) + β1
(xi − x̄)xi +
(xi − x̄)ui
i
i
= β1
i
X
i
Manini Ojha (JSGP)
EC-14
2
(xi − x̄) +
X
(xi − x̄)ui
i
Fall, 2020
72 / 237
Combining the numerator and denominator:
P
P
β1
(xi − x̄)2
(x − x̄)ui
Pi i
β̂1 = P i
+
2
2
(x
−
x̄)
i
i P
i (xi − x̄)
(xi − x̄)ui
= β1 + Pi
2
i (xi − x̄)
P
(xi − x̄)ui
]
∴ E (β̂1 ) = β1 + E [ Pi
(xi − x̄)2
P i
(xi − x̄)
E (β̂1 ) = β1 + P i
E (ui )
2
i (xi − x̄)
(16)
E (β̂1 ) = β1
Manini Ojha (JSGP)
EC-14
Fall, 2020
73 / 237
Proof for β0 is now straightforward
E (β̂0 ) = E [ȳ − β̂1 x̄]
h
i
= E β0 + β1 x̄ + ū − β̂1 x̄
= β0 + β1 x̄ − x̄E [β̂1 ]
= β0 + β1 x̄ − x̄β1
(17)
E (β̂0 ) = β0
Manini Ojha (JSGP)
EC-14
Fall, 2020
74 / 237
Gauss-Markov assumptions for SLR
So we now know that sampling distribution of β̂1 is centered around
β1 i.e. β̂1 is unbiased
What can we say about how far we can expect β̂1 to be from β1 on
average?
This helps us choose the best estimator among a broad class of
unbiased estimators
We will work with the measure of spread of the estimators - variance
Manini Ojha (JSGP)
EC-14
Fall, 2020
75 / 237
(SLR. 5) Homoskedasticity: variance of the unobservable u conditional
on x is constant. Also known as constant variance assumption
var (u|x) = σ 2
(18)
i.e. error has the same variance given any value of the explanatory variable.
It is often called the error variance or disturbance variance
√
Standard deviation of error = σ 2 = σ
For random sample we say
var (ui ) = σ 2 ∀i
If the error variance is dependent on x, then we say the error term
exhibits heteroskedasticity
Var (u|x) = σx2
Read JW for more on homoskedasticity. Section 2-5b , JW 6th
ed. HW!
Manini Ojha (JSGP)
EC-14
Fall, 2020
76 / 237
Homoskedasticity assumption plays no role in showing that the
estimators β̂0 and β̂1 are unbiased
This assumption tells us about efficiency of the estimators
Var (u|x) = E (u 2 |x) − [E (u|x)]2
Var (u|x) = E (u 2 |x) − [E (u|x)]2
∵ E (u|x) = 0
Var (u|x) = E (u 2 |x)
= σ2
Larger σ means that the distribution of the unobservables affecting y
is more spread out
Also, given homoskedasticity: V (y |x) = V (u|x) = σ 2 (Convince
yourself of this!)
Manini Ojha (JSGP)
EC-14
Fall, 2020
77 / 237
Sampling variance of OLS estimators
Sampling variance of β̂1 is:
σ2
σ2
=
2
SSTx
i (xi − x̄)
(19)
Var (β̂1 ) = P
And,
Manini Ojha (JSGP)
P
P
σ 2 n−1 i xi2
σ 2 i xi2
Var (β̂0 ) = P
=
2
n.SSTx
i (xi − x̄)
EC-14
(20)
Fall, 2020
78 / 237
Proof of Eqn. 19:
Recall that Var (R) = E [R 2 ] − (E [R])2 = E (R − E [R])2
h
i
Var (β̂1 ) = E (β̂1 − β1 )2 since β̂1 is unbiased
Start with (β̂1 − β1 )2 =?
P
P
yi (xi − x̄)
(xi − x̄)(β0 + β1 xi + ui )
i
P
β̂1 = P
= i
2
2
i (xi − x̄)
i (xi − x̄)
P
P
2
β1 i (xi − x̄) + i (xi − x̄)ui
P
=
2
i (xi − x̄)
P
(xi − x̄)ui
= β1 + Pi
2
i (xi − x̄)
P
2
(xi − x̄)ui
(β̂1 − β1 )2 = Pi
2
i (xi − x̄)
Manini Ojha (JSGP)
EC-14
Fall, 2020
79 / 237
Now, take expectations
P
(xi − x̄)ui 2
i
(β̂1 − β1 ) = P
(xi − x̄)2
"i P
2 #
h
i
(x
−
x̄)u
i
i
Pi
E (β̂1 − β1 )2 = E
2
i (xi − x̄)
X
1
Var (β̂1 ) = P
.E
[(
(xi − x̄)2 ui2 )]
[ i (xi − x̄)2 ]2
i
P
2
(xi − x̄)
Var (β̂1 ) = P i
.E (ui 2 )
[ i (xi − x̄)2 ]2
2
σ2
σ2
=
2
SSTx
i (xi − x̄)
Var (β̂1 ) = P
Manini Ojha (JSGP)
EC-14
Fall, 2020
80 / 237
Implication of variance of estimators:
If σ 2 ↑ =⇒ Var (β̂1 ) ↑
larger the error variance, larger the Var (β̂1 )
more variation in the unobservables affecting y makes it more difficult
to precisely estimate β1
var (x) ↓=⇒ Var (β̂1 ) ↑
we can learn more about the relationship between y and x if x is more
dispersed and there is less ‘noise’ in the relationship
Manini Ojha (JSGP)
EC-14
Fall, 2020
81 / 237
Theorem - Gauss-Markov
In the class of linear, unbiased estimators of β, β̂OLS has the smallest
variance. In other words, if there exists an alternative linear, unbiased
estimator, say β̃, then
Var (β̂OLS ) < Var (β̃)
∴ BLUE
Manini Ojha (JSGP)
EC-14
Fall, 2020
82 / 237
Errors and residuals
Emphasizing the difference between errors and residuals as it is crucial
for estimating σ 2
Population model in terms of a random sample
yi = β0 + β1 xi + ui
where ui is the error for observation i
Expressing yi in terms of fitted value and residuals
yi = β̂0 + β̂1 xi + ûi
Comparing these 2 equations
errors show up in the equation containing population parameters
residuals show up in the estimated equation
errors are never observed while residuals are computed from the data
Manini Ojha (JSGP)
EC-14
Fall, 2020
83 / 237
We do not observe the errors ui but we have estimates of ui namely
residuals ûi .
Unbiased estimator of σ 2 is denoted as σ̂ 2 :
E (σ̂ 2 ) = σ 2
where
n
σ̂ 2 =
SSR
1 X 2
ûi =
n−2
n−2
(21)
i=1
where the degrees of freedom is n − 2
degrees of freedom: total number of observations in the sample less the
number of independent restrictions put on them
Manini Ojha (JSGP)
EC-14
Fall, 2020
84 / 237
Standard error of regression (SER)/ root mean squared error
√
σ̂ = σ̂ 2
σ̂ is also called the estimate of standard deviation of β̂0 and β̂1
σ2
Var (βˆ1 ) =
SSTx
σ
sd(β̂1 ) = √
SSTx
σ̂
∴ se(β̂1 ) = √
SSTx
(22)
(23)
Eqn. 23 is the standard error of β̂1 . This gives us an idea of how
precise β̂1 is.
Manini Ojha (JSGP)
EC-14
Fall, 2020
85 / 237
More assumptions
(SLR. 6.) Each error is normally distributed
ui |xi ∼ N(0, σ 2 )
This assumption is needed when deriving small sample sampling
distributions of OLS estimators β̂0 and β̂1 and that of the t-statistics
used in hypothesis tests of β0 and β1
Random errors ui and uj for 2 different values of i and j are
uncorrelated with each other. That is
E [ui uj ] = 0 for all i and j; i 6= j
Also, called spherical errors/disturbances.
Manini Ojha (JSGP)
EC-14
Fall, 2020
86 / 237
Regression through the origin
Special case: model without an intercept
(24)
ỹi = β̃1 xi
Eqn. 24 is also called regression through the origin because it passes
through x = 0, ỹ = 0
Obtain the slope estimate using OLS
min
β̃1
n
X
2
(yi − β̃1 xi ) =
∂
Pn
i=1 (yi
∂ β̃1
i=1
= −2
− β̃1 xi )2
n
X
xi (yi − β̃1 xi )
i=1
Manini Ojha (JSGP)
EC-14
Fall, 2020
87 / 237
=⇒
n
X
xi (yi − β̃1 xi ) = 0
i=1
Pn
i=1 (xi yi )
β̃1 = P
n
2
i=1 xi
Comparing β̂1 and β̃1 : the two estimates are the same when x̄ = 0.
Prove ! [Hint: substitute x = 0 in Eqn. 9]
Manini Ojha (JSGP)
EC-14
Fall, 2020
88 / 237
Covariance and correlation
Recall , sign of slope estimate
P
(xi − x̄)(yi − ȳ )
Cov (xi , yi )
P
β̂1 =
=
2
(xi − x̄)
Var (xi )
Furthermore,
ρxy = correlation coefficient between x and y
Cov (x, y )
p
=p
Var (x) Var (y )
p
Var (x)
= β̂1 p
Var (y )
p
Var (y )
∴ β̂1 = ρxy p
Var (x)
sgn{β̂1 } = sgn{ρxy }
β̂1 ∝ ρxy
Manini Ojha (JSGP)
EC-14
Fall, 2020
89 / 237
Coefficient of determination and correlation coefficient
In CLRM, R 2 = ρ2xy
Proof:
X
X
SSE =
(ŷi − ȳ )2 =
(β̂0 + β̂1 xi − (β̂0 + β̂1 x̄))2
X
= β̂12
(xi − x̄)2
P
β̂12 (xi − x̄)2
SSE
2
= P
R =
SST
(yi − ȳ )2
2
Cov (x, y ) Var (x)
=
Var (x)
Var (y )
2
Cov (x, y )
=
Var (x)Var (y )
)2
(
Cov (x, y )
p
= p
Var (x) Var (y )
= ρ2xy
Manini Ojha (JSGP)
EC-14
Fall, 2020
90 / 237
Wrap-up
OLS Estimation
Given a random sample {yi , xi }ni=1 , OLS minimizes the sum of squared
residuals
argmin
β̂0 ,β̂1
n
X
i=1
ûi2
= argmin
β̂0 ,β̂1
n
X
(yi − β̂0 − β̂1 xi )2
i=1
Solution implies
Pn
Pn
(xi − x̄)(yi − ȳ )
yi (xi − x̄)
Cov (xi , yi )
OLS
i=1
P
P
β̂1 =
=
= ni=1
n
2
(x
−
x̄)
x
(x
−
x̄)
Var (xi )
i=1 i
i=1 I i
β̂0OLS = ȳ − β̂1OLS x̄
Manini Ojha (JSGP)
EC-14
Fall, 2020
91 / 237
Two goals:
1
Judge the quality of OLS estimates under CLRM assumptions (SLR.1
to SLR.5)
1
2
2
SLR.1 to SLR. 4 =⇒ unbiasedness
SLR.1 to SLR. 5 =⇒ BLUE
Alter OLS estimates when CLRM assumptions are violated
Manini Ojha (JSGP)
EC-14
Fall, 2020
92 / 237
Multiple Linear Regression Model
Multiple regression analysis is more amenable to ceteris paribus
analysis
allows us to explicitly control for many other factors that
simultaneously affect dependent variable
If we add more factors to our model that are useful for explaining y
=⇒ more of the variation in y can be explained
Thus, MLR
used to build better models for predicting y
allows for much more flexibility
Manini Ojha (JSGP)
EC-14
Fall, 2020
93 / 237
Example 1: MLR
Simple variation of our wage education example:
wage = β0 + β1 educ + β2 exper + u
where
educ : no.of years of education
exper : no.of years of experience in the labor market
Compared with a SLR relating wage to educ, here in MLR
we effectively take exper out of the error term
put it explicitly in the equation
Manini Ojha (JSGP)
EC-14
Fall, 2020
94 / 237
Similar to SLR, in MLR we make assumptions about u
Since, exper has been accounted for explicitly
can measure the effect of educ on wages holding experience fixed or
can measure the effect of exper on wages holding education fixed
In SLR, because we didn’t take into account exper explicitly
we assumed that experience (which was part of the error term) is
uncorrelated with educ
Manini Ojha (JSGP)
EC-14
Fall, 2020
95 / 237
Example 2: MLR
avgscore = β0 + β1 expend + β2 avginc + u
If the question is how per-student spending (expend) affects average
standardized test scores, then we are interested in β1
i.e. the ceteris paribus effect of expend on avgscore
By including avginc, we are controlling for the effect of average family
income on average score
Likely important as average family income tends to be correlated with
per-student expenditure
Manini Ojha (JSGP)
EC-14
Fall, 2020
96 / 237
Example 3: MLR
MLR also useful in generalizing functional forms
Suppose family consumption is a quadratic function of family income
then
cons = β0 + β1 inc + β2 inc 2 + u
Here u contains all other factors affecting consumption apart from
income
Model depends on only one observed factor (even though more than
one explanatory variable)
Here β1 does not measure the effect of income on consumption
holding inc 2 fixed as
if inc changes, inc 2 also changes!
Manini Ojha (JSGP)
EC-14
Fall, 2020
97 / 237
Here, change in consumption w.r.t. to change in income is given by
the marginal propensity to consume
∂cons
≈ β1 + 2β2 inc
∂inc
Marginal effect of income on consumption depends on
β1 and β2 and inc (level of income)
more on this later...
Manini Ojha (JSGP)
EC-14
Fall, 2020
98 / 237
Model with 2 - independent variables
Model
y = β0 + β1 x1 + β2 x2 + u
β0 is the intercept
β1 measures the change in y w.r.t change in x1 , holding other factors
fixed
β2 measures the change in y w.r.t change in x2 , holding other factors
fixed
Manini Ojha (JSGP)
EC-14
Fall, 2020
99 / 237
Model with k - independent variables
General form
y = β0 + β1 x1 + β2 x2 + .... + βk xk + u
β0 is the intercept
β1 measures the change in y w.r.t change in x1 , holding other factors
fixed
β2 measures the change in y w.r.t change in x2 , holding other factors
fixed and so on
Here, there are k-independent variables and an intercept
thus, equation consists of k + 1 unknown population parameters
terminology: 1- intercept parameter and k - slope parameters
Note: no matter how many variables we include in our model, there
will always be factors we cannot include =⇒ collectively contained in u
Manini Ojha (JSGP)
EC-14
Fall, 2020
100 / 237
Example 4: MLR
Regressing CEO’s salary (salary ) on firm sales (sales) and CEO tenure
(ceoten)
log (salary ) = β0 + β1 log (sales) + β2 ceoten + β3 ceoten2 + u
k =3
x1 = log (sales); x2 = ceoten; x3 = ceoten2
β1 : interpreted as the elasticity of salary w.r.t sales
If β3 = 0, then β2 is the effect of one more year of ceoten on salary
if β3 6= 0, then the effect of ceoten on salary is
∂log (salary )
= β2 + 2β3 ceoten
∂ceoten
also called the marginal effect (more later...)
Manini Ojha (JSGP)
EC-14
Fall, 2020
101 / 237
Regressing CEO’s salary (salary ) on firm sales (sales) and CEO tenure
(ceoten)
log (salary ) = β0 + β1 log (sales) + β2 ceoten + β3 ceoten2 + u
Reminder: This equation is an example of multiple linear regression
linear in parameters βj
non-linear in relationship between y and x 0 s
At minimum, what MLR requires is that all factors in unobserved error
term be uncorrelated with explanatory variables
Manini Ojha (JSGP)
EC-14
Fall, 2020
102 / 237
OLS estimates - 2 independent vars
Model with 2 independent variables: estimated equation is given by
ŷ = β̂0 + β̂1 x1 + β̂2 x2
β̂0 is the estimate of β0
β̂1 is the estimate of β1
β̂2 is the estimate of β2
OLS (like before) chooses estimates s.t. SSR is minimized
min
n
X
i=1
Manini Ojha (JSGP)
ûi2 =
n
X
(yi − ŷi )2 =
i=1
n
X
(yi − β̂0 − β̂1 xi1 − β̂xi2 )2
i=1
EC-14
Fall, 2020
103 / 237
Indexing
Important to master the meaning of indexing of
independent/explanatory vars
Independent var is followed by 2 subscripts
First subscript i refers to the observation number (i = 1, .....n)
Second subscript refers to a way of distinguishing between the
explanatory/independent variables (j = 1, ......, k)
Example: Wage-Educ-Exper
xi1 = educi is education for person i in the sample
xi2 = exper
Pn i is experience for person i in the sample
SSR is i=1 (wagei − β̂0 − β̂1 educi − β̂2 experi )2
Manini Ojha (JSGP)
EC-14
Fall, 2020
104 / 237
Indexing
Thus, xij is the i th observation on the j th independent variable
Manini Ojha (JSGP)
EC-14
Fall, 2020
105 / 237
OLS estimates - k-independent vars
General form: SSR minimizes
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )2
i=1
k + 1 estimates are chosen to minimize the SSR
Solved by multi-variable calculus
Manini Ojha (JSGP)
EC-14
Fall, 2020
106 / 237
You get k + 1 equations in k + 1 unknowns β̂0 , β̂1 , ..., β̂k
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik ) = 0
i=1
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xi1 = 0
i=1
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xi2 = 0
i=1
.
.
.
n
X
(yi − β̂0 − β̂1 xi1 − β̂2 xi2 − .... − β̂k xik )xik = 0
i=1
Called the OLS FOCs
Manini Ojha (JSGP)
EC-14
Fall, 2020
107 / 237
Terminology
OLS regression line
(25)
ŷ = β̂0 + β̂1 x1 + β̂2 x2 + .... + β̂k xk
We say “we ran an OLS regression of y on x1 , x2 , x3 , ......,xk ” or
We regressed y on x1 , x2 , x3 , ......,xk
Manini Ojha (JSGP)
EC-14
Fall, 2020
108 / 237
Interpreting OLS regression function
Back to: model with 2 independent variables
(26)
ŷ = β̂0 + β̂1 x1 + β̂2 x2
Intercept β̂0 is the predicted value of y when x1 = 0 and x2 = 0
Sometimes this is an interesting case, other times, this will not make
sense (later..)
Estimate β̂1 and estimate β̂2 have partial effect interpretation
Manini Ojha (JSGP)
EC-14
Fall, 2020
109 / 237
From Eqn. 26
4ŷ = β̂1 4x1 + β̂2 4x2
When x2 is held constant, 4x2 = 0, then
4ŷ = β̂1 4x1
4ŷ
β̂1 =
4x1
When x1 is held constant, 4x1 = 0, then
4ŷ = β̂2 4x2
4ŷ
β̂2 =
4x2
Manini Ojha (JSGP)
EC-14
Fall, 2020
110 / 237
Example 5: MLR
Determinants of College GPA
\ = 1.29 + 0.453hsGPA + 0.0094ACT
colGPA
n = 141
Slope coefficient on hsGPA:
comment on magnitude
comment on direction: positive partial relationship b/w colGPA and
hsGPA
interpret: holding ACT fixed, one more point on hsGPA −→ almost
half a point rise in colGPA
If we choose 2 students A and B with the same exact ACT scores, but
A has 1 point higher hsGPA than B, then we predict A’s colGPA to be
0.453 higher than B
Slope coefficient on ACT :
Positive partial relationship between colGPA and ACT
Holding hsGPA fixed, 1 more point on ACT increases colGPA by less
than 1/10th of a point
Manini Ojha (JSGP)
EC-14
Fall, 2020
111 / 237
Example 6: MLR
Model
log\
(wage) = 0.284 + 0.092educ + 0.0041exper + 0.022tenure
What is the estimated effect of increasing one more year an individual
stays at the same firm on wages?
4log\
(wage) = 0.00414exper + 0.0224tenure
both experience and tenure would increase by one year
resulting effect:(holding education fixed) leads to 2.6 % increase in
wages
4log\
(wage) = 0.0041 + 0.022 = 0.0261
Manini Ojha (JSGP)
EC-14
Fall, 2020
112 / 237
OLS fitted values and residuals
Once we obtain the OLS regression line
ŷ = β̂0 + β̂1 x1 + β̂2 x2 + .... + β̂k xk
We can obtain the fitted/predicted value for each observation, say for
observation i:
ŷi = β̂0 + β̂1 xi1 + β̂2 xi2 + .... + β̂k xik
it is the predicted value we obtain by plugging in values of x1 to xk for
observation i
Normally the yi will not equal the predicted value ŷi
But OLS minimizes the average squared prediction error
OLS doesn’t say anything about prediction error for any particular
observation though
As before, residual for each observation can be computed:
ûi = yi − ŷi
Manini Ojha (JSGP)
EC-14
Fall, 2020
113 / 237
Residual:
ûi = yi − ŷi
If
ûi > 0 =⇒ ŷi < yi
yi is underpredicted
If
ûi < 0 =⇒ ŷi > yi
yi is overpredicted
Manini Ojha (JSGP)
EC-14
Fall, 2020
114 / 237
Properties of OLS residuals
As before
Sample average of residuals is zero: ȳ = ŷ¯
Sample covariance between each independent var and OLS residuals is
zero
=⇒sample covariance between OLS fitted values and OLS residuals is
zero
Point (x̄1 , x̄2 , x̄3 , ....., x̄k , ȳ ) is always on the OLS regression line
ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + ..... + β̂k x̄k
Manini Ojha (JSGP)
EC-14
Fall, 2020
115 / 237
“Partialling out” interpretation of slope parameter
In a model with k = 2: ŷ = β̂0 + β̂1 x1 + β̂2 x2
Here, the slope parameter
Pn
rˆi1 yi )
β̂1 = Pi=1
n
2
i=1 rˆi1
(
(27)
where
rˆi1 are the OLS residuals from simple regression of x1 on x2 using the
same sample (no proof required!)
We follow a step process to get to slope parameter:
1
2
We first regress x1 on x2 −→ obtain residuals rˆ1 (y has no role)
We then regress y on rˆ1 −→ obtain β̂1
Manini Ojha (JSGP)
EC-14
Fall, 2020
116 / 237
“Partialling out” Interpretation of MLR
The way to interpret this is:
residuals rˆi1 are the part of xi1 that are uncorrelated with xi2 or
residuals rˆi1 is the effect of xi1 on y after the effects of xi2 have been
partialled out/netted out
β̂1 measures the sample relationship between y and x1 after x2 has
been partialled out
Manini Ojha (JSGP)
EC-14
Fall, 2020
117 / 237
“Partialling out” Interpretation of MLR
In the model with k explanatory vars
β̂1 can still be written as in Eqn. 27 but rˆi1 will be residuals from
regression of x1 on x2 , ....xk
So measures the effect of x1 on y after partialling out effects of x2 ,.....,
xk
Manini Ojha (JSGP)
EC-14
Fall, 2020
118 / 237
Comparison of SLR and MLR slope parameter
If
SLR: ỹ = β̃0 + β̃1 x1
MLR: ŷ = β̂0 + β̂1 x1 + β̂2 x2
Then,
β̃1 = β̂1 + β̂2 δ̃1
where
δ̃1 is the slope parameter from simple regression of x2 on x1 :
x̃2 = δ̃0 + δ̃1 x1
Thus, β̃1 6= β̂1
Manini Ojha (JSGP)
EC-14
Fall, 2020
119 / 237
Special case ( 2 independent vars ) : β̃1 = β̂1
only if
1
2
The partial effect of x2 on ŷ is zero in the sample i.e. β̂2 = 0
x1 and x2 are uncorrelated in the sample i.e. δ̃1 = 0
Special case ( k−independent vars ): β̃1 = β̂1
only if
1
2
The OLS coefficients on x2 through xk are all zero or
x1 is uncorrelated with each of x2 , ....., xk
Neither likely in practice, but possible that the correlations are small in
which case the slope parameters will be similar
Manini Ojha (JSGP)
EC-14
Fall, 2020
120 / 237
Example
An econometrician wrongly regresses height of individuals on nutrition
of individuals as follows:
˜ = β̃0 + β̃1 nutrn
ht
Now, say she realizes the correct model and regresses height on
nutrition as well as HH income:
ˆ = β̂0 + β̂1 nutrn + β̂2 inc
ht
Nutrition is likely positively correlated with income s.t.
˜ = δ̃0 + δ̃1 nutrn
inc
Effect of individual’s nutrition on height in the SLR would actually be
made up of the partial effect of own nutrition on height (after
partialling out the effect of income on height) + effect of income on
height*the effect of nutrition on income
β̃1 = β̂1 + β̂2 δ̃1
Manini Ojha (JSGP)
EC-14
Fall, 2020
121 / 237
Goodness-of-Fit
Total variation in {yi }: total sum of squares (SST)
n
X
SST =
(yi − ȳ )2
i=1
Total variation in {ŷi }: explained sum of squares (SSE)
SSE =
n
X
(ŷi − ȳ )2
i=1
Total variation in {ûi } :residual sum of squares (SSR)
SSR =
n
X
ûi2
i=1
SST = SSE + SSR
R 2 = SSE /SST = 1 − SSR/SST
defined as the proportion of sample variation in yi that is explained by
the OLS regression line. Lies between 0 and 1
Manini Ojha (JSGP)
EC-14
Fall, 2020
122 / 237
Goodness-of-Fit
Recall R 2 = (Correlation coefficient)2 = ρ2
R 2 usually increases when we add more and more explanatory
variables to the regression
Mathematically this happens as SSR never increases if more
explanatory vars are added
Poor tool for deciding if another x should be added or not
Should ideally depend on whether x has a non-zero partial effect on y
Manini Ojha (JSGP)
EC-14
Fall, 2020
123 / 237
MLR - assumptions
MLR.1 : Linearity in parameters
MLR.2 : Random sampling
MLR.3 : No perfect collinearity
none of the explanatory variables is constant
no exact linear relationships among the explanatory variables
Bivariate model: x1 and x2 are linearly independent
We allow for some correlation but no perfect correlation (otherwise
meaningless econometric analysis)
MLR.4 : Zero conditional mean
MLR.5 : Homoskedasticity
Collectively MLR.1 through MLR. 5 called the Gauss-Markov
assumptions
Manini Ojha (JSGP)
EC-14
Fall, 2020
124 / 237
MLR. 3 - violation: case 1
Example: 2 candidates and an election
regress percentage of vote for candidate A on campaign expenditures
voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u
Model violates MLR.3. as perfect collinearity : x3 = x1 + x2
=⇒ totexpend = expendA + expendB
Try interpreting β1
measures the effect _____ on ____ keeping ___ fixed.
Manini Ojha (JSGP)
EC-14
Fall, 2020
125 / 237
MLR. 3 - violation: case 1
Example: 2 candidates and an election
regress percentage of vote for candidate A on campaign expenditures
voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u
Model violates MLR.3. as perfect collinearity : x3 = x1 + x2
=⇒ totexpend = expendA + expendB
Try interpreting β1
measures the effect _____ on ____ keeping ___ fixed.
Nonsense!
Manini Ojha (JSGP)
EC-14
Fall, 2020
125 / 237
MLR. 3 - violation: case 1
Example: 2 candidates and an election
regress percentage of vote for candidate A on campaign expenditures
voteA = β0 + β1 expendA + β2 expendB + β3 totexpend + u
Model violates MLR.3. as perfect collinearity : x3 = x1 + x2
=⇒ totexpend = expendA + expendB
Try interpreting β1
measures the effect _____ on ____ keeping ___ fixed.
Nonsense!
Solution to perfect collinearity is to simply drop one of the explanatory
vars from the regression
Manini Ojha (JSGP)
EC-14
Fall, 2020
125 / 237
MLR. 3 - violation - case 2
MLR.3. is violated if the same explanatory variable is measured in
different units in the same regression
income measured both in rupees and thousands of rupees
Manini Ojha (JSGP)
EC-14
Fall, 2020
126 / 237
MLR. 3 - violation - case 3
MLR.3 also fails if the sample size n is too small in relation to k
In general k−independent var model, there are
k + 1 parameters
MLR.3. fails if n < k + 1
Why?
need k + 1 observations at least to estimate k + 1 parameters
Manini Ojha (JSGP)
EC-14
Fall, 2020
127 / 237
MLR.4. - violation
Omitted variables
If important explanatory variables are omitted from the regression: u
may be correlated with x 0 s
Reasons: data limitation or ignorance in case of actual application
Endogeneity
u is correlated with explanatory variables
If xj is uncorrelated with u =⇒ “exogenous explanatory vars”
If xj is correlated with u =⇒ “endogenous explanatory vars”
Manini Ojha (JSGP)
EC-14
Fall, 2020
128 / 237
Caution!
Before we proceed, a word of caution:
Do not confuse MLR. 3 and MLR.4
MLR. 3. rules out certain relationship between x 0 s (easier to deal with)
MLR. 4. rules out relationship between u and x (difficult to identify
and deal with)
Violation −→ bias in OLS estimators
If bivariate −→ bias in all 3 OLS estimators
If k-independent variable model −→ bias in all k + 1 OLS estimators
Manini Ojha (JSGP)
EC-14
Fall, 2020
129 / 237
Unbiasedness
Under the assumptions MLR. 1 through MLR. 4
E (β̂j ) = βj , j = 0, 1, ...., k
OLS estimators are unbiased estimators of the population parameters
Manini Ojha (JSGP)
EC-14
Fall, 2020
130 / 237
Unbiasedness
Useful to remember the meaning of unbiasedness:
Run a regression of log (wages) on educ
Find the estimate of the coefficient attached educ to be 9.2%.
Tempting, to say something like “9.2% is an unbiased estimate of the
return to education.”
But, when we say that OLS is unbiased under MLR.1 through MLR.4,
we mean
the procedure by which the OLS estimates are obtained is unbiased
we hope that we’ve obtained a sample that gives us an estimate close
to the population value, but, unfortunately, this cannot be assured.
What is assured is that we have no reason to believe our estimate is
more likely to be too big or too small.”
Manini Ojha (JSGP)
EC-14
Fall, 2020
131 / 237
Issues in MLR
1
Inclusion of irrelevant variables/ over-specifying the model
2
Excluding a relevant variable/ under-specifying the model
Manini Ojha (JSGP)
EC-14
Fall, 2020
132 / 237
Including irrelevant vars/Overspecifying
Means one (or more) of explanatory vars is included in the regression
that have no partial effect on dependent variable
y = β0 + β1 x1 + β2 x2 + β3 x3 + u
Lets assume that model satisfies MLR.1 through MLR. 4
However, x3 has no effect on y after x1 and x2 have been controlled for
i.e. β3 = 0
x3 may or may not be correlated with x1 or x2
No need to include x3 in the model when its coefficient in the
population is zero.
Has no effect on unbiasedness of OLS estimators
Has undesirable effects on efficiency of OLS estimators (affects
variances)
Manini Ojha (JSGP)
EC-14
Fall, 2020
133 / 237
Omitted variable/under-specifying
Means we omit a variable that actually belongs to the true population
model
y = β0 + β1 x1 + β2 x2 + u
This would ideally give the OLS regression
ŷ = β̂0 + β̂1 x1 + β̂2 x2
Lets assume that this model satisfies MLR.1 through MLR. 4
Suppose, primary interest is in β1 (partial effect of x1 on y )
But due to some reason (say unavailability of data), we run the
following regression instead
ỹ = β̃0 + β̃1 x1
Manini Ojha (JSGP)
EC-14
Fall, 2020
134 / 237
Omitted variable bias (OVB)
Comparing β̃1 and β̂1 , we know
(28)
β̃1 = β̂1 + β̂2 δ̃1
δ̃1 depends only on independent variables, so we take it as non-random
Bias:
E (β̃1 ) = E (β̂1 + β̂2 δ̃1 )
= E (β̂1 ) + δ̃1 E (β̂2 )
= β1 + β2 δ̃1
Cov (x1 x2 )
= β1 + β2
Var (x1 )
E (β̃1 ) − β1 = β2 δ̃1 = OVB
Manini Ojha (JSGP)
EC-14
(29)
Fall, 2020
135 / 237
OVB = 0 if
β2 = 0 i.e. x2 does not appear in the true model or
δ̃1 = 0 i.e. x1 and x2 are uncorrelated
Sign of OVB
β2 > 0
β2 < 0
Manini Ojha (JSGP)
corr (x1 , x2 ) > 0
+
-
EC-14
corr (x1 , x2 ) < 0
+
Fall, 2020
136 / 237
Upward bias or downward bias?
If E (β̃1 ) > β1 , then β̃1 has an upward bias
If E (β̃1 ) < β1 , then β̃1 has a downward bias
Manini Ojha (JSGP)
EC-14
Fall, 2020
137 / 237
Example:
wage = β0 + β1 educ + β2 ability + u
If however regress (due to data issues)
wage = β0 + β1 educ + v
Do you think there is an OVB here? What is the likely sign of OVB
here?
corr (x1 , x2 ) =? ; sgn{β2 } =?
sgn{E (β̃1 ) − β1 } =?
Manini Ojha (JSGP)
EC-14
Fall, 2020
138 / 237
Example:
avgscore = β0 + β1 expend + β2 povrate + u
If however regress
avgscore = β0 + β1 expend + v
Is there an OVB here? What is the likely sign of OVB here?
corr (x1 , x2 ) =? ; sgn{β2 } =?
sgn{E (β̃1 ) − β1 } =?
Manini Ojha (JSGP)
EC-14
Fall, 2020
139 / 237
OVB contd..
Deriving the sgn{OVB} is more difficult in a general case (with more
explanatory variables)
Note: If corr (u, xj ) 6= 0, then all OLS estimators are biased
Which assumption is violated here?
Example: If population model:
y = β0 + β1 x1 + β2 x2 + β3 x3 + u
Satisfies MLR.1 through MLR. 4
But we omit x3 and end up estimating:
ỹ = β̃0 + β̃1 x1 + β̃2 x2
Manini Ojha (JSGP)
EC-14
Fall, 2020
140 / 237
Suppose
corr (x2 , x3 ) = 0 but corr (x1 , x3 ) 6= 0
Tempting to think β̃1 is probably biased but β̃2 will not be.
But both β̃1 and β̃2 will be biased!
Unless!
corr (x1 , x2 ) = 0
Rough guide to figure out the sign of OVB
if corr (x1 , x2 ) = 0 is actually true, then what would the sign of bias be
(x1 x3 )
given E (β̃1 ) = β1 + β3 cov
var (x1 )
Manini Ojha (JSGP)
EC-14
Fall, 2020
141 / 237
Variance of OLS estimators
MLR.5 : homoskedasticity
Example:
wage = β0 + β1 educ + β2 exper + β3 tenure + u
Homoskedasticity requires that the variance of unobserved error does
not depend on levels of education, experience or tenure i.e.
Var (u|educ, exper , tenure) = σ 2
If variance of error changes with any of the 3 −→ heteroskedasticity
MLR. 1 through MLR. 5 are needed to get to variance of β̂j
Var (β̂j ) =
σ2
SSTj (1 − Rj2 )
(30)
where Rj2 is the R-squared from the regression of xj on all other
explanatory variables
Manini Ojha (JSGP)
EC-14
Fall, 2020
142 / 237
Variance of OLS estimator
Var (β̂j ) =
σ2
SSTj (1 − Rj2 )
Size of Var (β̂j ) is important. Why?
Larger the variance
the less precise the estimator
larger the confidence intervals
less accurate the hypotheses test
Manini Ojha (JSGP)
EC-14
Fall, 2020
143 / 237
Components of OLS variance
Var (β̂j ) depends on 3 factors
1
2
3
σ2
SSTj
Rj2
Manini Ojha (JSGP)
EC-14
Fall, 2020
144 / 237
Component 1 - Error variance
σ2:
Larger the error variance −→larger the sampling variance for OLS
estimator
more “noise” in the equation −→ more difficult to estimate the precise
partial effect of any xj on y
For any given dependent variable y , the only way to reduce error
variance:
adding more x 0 s (that can explain more of y leaving “little” in the error
u)
problem: unfortunately, not always possible to find additional
legitimate x 0 s that affect y
Manini Ojha (JSGP)
EC-14
Fall, 2020
145 / 237
Component 2 - Total sample variance in xj
SSTj :
Larger the SSTj −→smaller the sampling variance for OLS estimator
as SSTj −→ 0 Var (β̂j ) −→ ∞
SSTj = 0 not allowed
Everything else equal, we prefer to have as much variation in x 0 s as
possible
A nice way of increasing sample variation in each of the x 0 s
increase the sample size itself i.e. n
Manini Ojha (JSGP)
EC-14
Fall, 2020
146 / 237
Component 3 - Linear association between x 0 s
Rj2 :
Distinct from R − squared from regression of y on x
Rj2 appears in the regression of one explanatory var on others
Proportion of total variation in xj that is explained by other explanatory
vars
Suppose k = 2: y = β0 + β1 x1 + β2 x2 + u
Then
Var (β̂1 ) =
σ2
SST1 (1 − R12 )
R12 is R − squared from the regression of x1 on x2
Higher R12 means x2 explains much of the variation in x1 or
x1 and x2 are highly correlated
For general k- independent var model :
If Rj2 ↑−→(1 − Rj2 ) ↓−→Var (β̂1 )↑
Manini Ojha (JSGP)
EC-14
Fall, 2020
147 / 237
Extreme cases:
If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!)
then we get smallest Var (β̂j ) for given SSTj and σ 2
If
Rj2
=1
Which assumption is violated?
Manini Ojha (JSGP)
EC-14
Fall, 2020
148 / 237
Extreme cases:
If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!)
then we get smallest Var (β̂j ) for given SSTj and σ 2
If
Rj2
=1
Which assumption is violated?
MLR.3. : perfect collinearity
Manini Ojha (JSGP)
EC-14
Fall, 2020
148 / 237
Extreme cases:
If Rj2 = 0 (i.e. xj has zero correlation with every other x. Rare!)
then we get smallest Var (β̂j ) for given SSTj and σ 2
If
Rj2
=1
Which assumption is violated?
MLR.3. : perfect collinearity
Relevant case:
If Rj2 −→ 1 : “close” to 1
Var (β̂j ) −→ ∞
This case is called “multicollinearity”: high but not perfect correlation
between 2 or more independent vars
Not a violation of MLR. 3.
Manini Ojha (JSGP)
EC-14
Fall, 2020
148 / 237
Multicollinearity: how much is too much?
Read [JW 5th ed] Chapter 3, p.95-98, section on: “Linear
relationship among the independent variables” (homework!)
Manini Ojha (JSGP)
EC-14
Fall, 2020
149 / 237
Trade-off between variance and bias
Choice between including a particular var or not in the model
analyze the trade-off
Recall:
Omitting relevant variable −→ bias
Including many vars −→ loss in efficiency
Manini Ojha (JSGP)
EC-14
Fall, 2020
150 / 237
Suppose true population model
(31)
y = β0 + β1 x1 + β2 x2 + u
Consider 2 estimates of β1 :
β̂1 from the MLR :
ŷ = β̂0 + β̂1 x1 + β̂2 x2
(32)
ỹ = β̃0 + β̃1 x1
(33)
β̃1 from the SLR :
Manini Ojha (JSGP)
EC-14
Fall, 2020
151 / 237
When β2 6= 0 , then
Eqn. 33 excludes a relevant variable −→ OVB unless corr (x1 , x2 ) = 0
β̃1 is biased
Therefore,
if bias is the only criterion used to decide which estimator is better,
then
Manini Ojha (JSGP)
EC-14
Fall, 2020
152 / 237
When β2 6= 0 , then
Eqn. 33 excludes a relevant variable −→ OVB unless corr (x1 , x2 ) = 0
β̃1 is biased
Therefore,
if bias is the only criterion used to decide which estimator is better,
then
β̂1 is preferred to β̃1
Manini Ojha (JSGP)
EC-14
Fall, 2020
152 / 237
But!
if variance is also brought into picture, then things change
We know, conditioning on values of x1 and x2 in the sample
Var (β̂1 ) =
σ2
SST1 (1 − R12 )
(34)
σ2
SST1
(35)
We also know
Var (β̃1 ) =
Comparing Eqn. 34 and Eqn. 35
Var (β̃1 ) < Var (β̂1 ) : β̃1 is more efficient
unless corr (x1 , x2 ) = 0 =⇒ β̃1 = β̂1
Manini Ojha (JSGP)
EC-14
Fall, 2020
153 / 237
If corr (x1 , x2 ) 6= 0, then following conclusions:
1
2
When β2 6= 0 : β̃1 is biased, β̂1 is unbiased, Var (β̃1 ) < Var (β̂1 )
When β2 = 0 : β̃1 and β̂1 are both unbiased and Var (β̃1 ) < Var (β̂1 )
including x2 in the model exacerbates multicollinearity problem −→ less
efficient estimator
Traditionally, econometricians have compared the likely size of the bias
with the reduction in variance to decide to include x2 or not
Manini Ojha (JSGP)
EC-14
Fall, 2020
154 / 237
Estimating σ 2 in MLR
Recall
In SLR, estimate of σ 2 is σ̂ 2
Similarly in MLR, estimate of σ 2 is σ̂ 2
Recall, in SLR
Pn
2
2
i=1 ûi
σ̂ =
In MLR,
2
n−2
Pn
σ̂ =
2
i=1 ûi
n − (k + 1)
where degrees of freedom df for general OLS with n observations and
k independent variables is n − k − 1
Manini Ojha (JSGP)
EC-14
Fall, 2020
155 / 237
Since
2
Pn
2
i=1 ûi
σ̂ =
and
n − (k + 1)
σ2
Var (β̂j ) =
SSTj (1 − Rj2 )
σ
and
⇒ sd(β̂j ) = q
SSTj (1 − Rj2 )
σ̂ estimates σ
σ̂
⇒ se(β̂j ) = q
SSTj (1 − Rj2 )
called the standard error of β̂j
utilized when we construct confidence intervals and conduct tests
Manini Ojha (JSGP)
EC-14
Fall, 2020
156 / 237
Efficiency of OLS: Guass-Markov Theorem
Why do we use OLS instead of the wide variety of estimation
methods?
We know under MLR.1 through MLR.4, OLS is unbiased
But there may be other unbiased estimators as well...
However, we also know that under MLR.1 through MLR.5, OLS also
has the smallest variance
Theorem
Gauss-Markov Theorem: Under assumptions MLR.1 through MLR.5,
β̂0 , β̂1 , ...., β̂k are the best linear unbiased estimators (BLUEs) of
β0 , β1 , ...., βk respectively.
Manini Ojha (JSGP)
EC-14
Fall, 2020
157 / 237
Language of MLR Analysis
Note: many econometricians report that they have “estimated an OLS
model”
Incorrect language
Correct usage “used the OLS estimation method” - OLS is not a
model
A model describes an underlying population and depends on unknown
parameters
Model:
y = β0 + β1 x1 + β2 x2 + ..... + βk xk + u
Can talk about interpretation of βj (any one of the unknown
parameters) without looking at the data, just by looking at the model
Of course, we learn much more about βj from data
Manini Ojha (JSGP)
EC-14
Fall, 2020
158 / 237
Language of MLR Analysis
Other ways of estimation exist:
weighted least squares, least absolute deviations, instrumental variables
etc. (Advanced Trics)
Finally,
Important to not use imprecise language
Leads to vagueness on important considerations like assumptions
Example of correct usage:
“I estimated the equation by ordinary least squares. Under the
assumption that no important variables have been omitted from the
equation, and assuming random sampling, the OLS estimator of β1 is
unbiased. If the error term u has constant variance, the OLS estimator
is actually best linear unbiased.”
Manini Ojha (JSGP)
EC-14
Fall, 2020
159 / 237
Inference
We know
Expected value of OLS estimators
Variance of OLS estimators
For statistical inference, need to know more than just the first 2
moments of β̂j
Need to know the full sampling distribution of β̂j
Sampling distributions of OLS estimators is entirely dependent on the
sampling distribution of the errors when we condition on values of
control variables in the sample
Manini Ojha (JSGP)
EC-14
Fall, 2020
160 / 237
Normality Assumption
MLR. 6: Normality assumption
The population error u is independent of the explanatory variables
x1, x2,...., xk and is normally distributed with zero mean and constant
variance σ 2 such that
u ∼ N(0, σ 2 )
Stronger assumption that any of the previous assumptions
If we make assumption MLR.6., we are necessarily making the
assumptions MLR.4. and MLR. 5. Why?
Manini Ojha (JSGP)
EC-14
Fall, 2020
161 / 237
Language
For cross-section regression analysis, the full set of assumptions means
MLR.1. through MLR.6. and collectively called the CLRM
assumptions
Under these 6 assumptions, we will call the model the Classical
Linear Regression Model (CLRM)
Manini Ojha (JSGP)
EC-14
Fall, 2020
162 / 237
In real-world application, assumption of normality of u is really an
empirical matter
Often times this is not true
No theorem says that wage conditional on education and experience is
normally distributed
In fact, most probably this is not true as wage cannot be less than
zero, so strictly speaking, cannot have a n.d.
Nevertheless, we consider this assumption and ask the question if the
distribution is “close” to being normal
Often, using a transformation like logs yields a distribution closer to
a normal.
Example: log (price) tends to have a distribution that looks more
normal than the distribution of price
Log is one of the most common transformations to get a skewed
distribution looking more like a normal
Manini Ojha (JSGP)
EC-14
Fall, 2020
163 / 237
Normal sampling distributions
Normality of errors assumptions =⇒ normal sampling distributions of
OLS estimators:
Theorem
Normal Sampling Distributions
Under the CLRM assumptions, MLR.1 through MLR.6, conditional on the
sample values of the independent variables,
β̂j ∼ N[βj , Var (β̂j )],
Therefore,
(β̂j − βj )
sd(β̂j )
∼ N(0, 1)
(36)
Standardized a normal r.v. by subtracting off its mean and dividing by
its standard deviation to get to a standard normal r.v.
Manini Ojha (JSGP)
EC-14
Fall, 2020
164 / 237
Normal sampling distributions
Theorem =⇒
1
2
Any linear combination of β̂0 , β̂1, ....., β̂k is also normally distributed
Any subset of β̂j has a joint normal distribution
Manini Ojha (JSGP)
EC-14
Fall, 2020
165 / 237
Hypothesis Testing
Recall, in general θ denotes unknown parameter and θ̂ is an estimate
If θ̂ is a unique value, then θ̂ is a point estimate
Hypothesis testing is a method of inference concerning the unknown
population parameters
Manini Ojha (JSGP)
EC-14
Fall, 2020
166 / 237
Begin with some definitions
Definition
A hypothesis is a statement about a population parameter
Manini Ojha (JSGP)
EC-14
Fall, 2020
167 / 237
Begin with some definitions
Definition
A hypothesis is a statement about a population parameter
Definition
The two complementary hypotheses in a hypothesis testing problem are
denoted the null and alternative hypothesis. These are denoted H0 and Ha
(sometimes H1 ) respectively
Manini Ojha (JSGP)
EC-14
Fall, 2020
167 / 237
Testing hypothesis about a single population parameter:
t-test
Population model
y = β0 + β1 x1 + ........ + βk xk + u
can hypothesize about value of βj and use hypothesis testing to draw
inference
Theorem
t- distribution for standardized estimators
Under CLRM assumptions
(β̂j − βj )
se(β̂j )
∼ tn−k−1 = tdf
(37)
Theorem on t - distribution is important as it allows us to test the
hypothesis involving βj .
Manini Ojha (JSGP)
EC-14
Fall, 2020
168 / 237
Primary interest is typically in testing the null hypothesis
H0 :βj = 0
(38)
What does this mean?
Since βj measures the partial effect of xj on (the expected value of) y ,
after controlling for all other independent variables, Eqn. 38 means
that, once x1 , x2 , ....., xj−1 , xj+1 , xk have been accounted for, xj has no
effect on the expected value of y
Example: returns to education
log (wage) = β0 + β1 educ + β2 exper + β3 tenure + u
Manini Ojha (JSGP)
EC-14
Fall, 2020
169 / 237
Primary interest is typically in testing the null hypothesis
H0 :βj = 0
(38)
What does this mean?
Since βj measures the partial effect of xj on (the expected value of) y ,
after controlling for all other independent variables, Eqn. 38 means
that, once x1 , x2 , ....., xj−1 , xj+1 , xk have been accounted for, xj has no
effect on the expected value of y
Example: returns to education
log (wage) = β0 + β1 educ + β2 exper + β3 tenure + u
H0 :β2 = 0 means that once education and tenure have been
accounted for, # years of past experience in the workforce has no
effect on hourly wages.
Manini Ojha (JSGP)
EC-14
Fall, 2020
169 / 237
The statistic we use to test the null against any alternative is called
the t−statistic or the t- ratio of β̂j
tβ̂j is
β̂j
se(β̂j )
We will say teduc is the t statistic for β̂educ
If β̂j > 0 then tβ̂j > 0
If β̂j < 0 then tβ̂j < 0
For given value of se(β̂j ), a larger value of β̂j −→ tβ̂j larger
Manini Ojha (JSGP)
EC-14
Fall, 2020
170 / 237
t- distribution derived by Gosset (1908)
Worked at Guinness brewery in Dublin which prohibited publishing due
to fear of revealing trade secrets
Gosset instead published under the name ‘Student’
Hence, also known as the Student-t dbn
Manini Ojha (JSGP)
EC-14
Fall, 2020
171 / 237
For null H0 : βj = 0,
Look at the unbiased estimator β̂j and ask how far is β̂j from zero?
Value of β̂j very far from zero provides evidence against the null
But sampling errors exist in our estimate which is accounted for by the
s.e.
Thus, tβ̂j measures how many estimated standard deviations β̂j is
away from zero
tβ̂j sufficiently far away from zero will result in rejection of the null
Precise rule depends on alternative hypothesis and chosen level of
significance
Manini Ojha (JSGP)
EC-14
Fall, 2020
172 / 237
Caution
Never write the Null as H0 :β̂j = 0 !!
We test hypotheses for population parameters NOT estimates from a
particular sample!
Manini Ojha (JSGP)
EC-14
Fall, 2020
173 / 237
Hypothesis testing procedure
Procedure typically entails
1
2
3
Construction of a test statistic which is a function of sample
estimate(s)
Specification of the rejection region for this test statistic
Comparison of the sample value with the rejection region
Manini Ojha (JSGP)
EC-14
Fall, 2020
174 / 237
Choosing a rejection rule
How do we choose a rejection rule?
First choose a significance level
Definition
Significance level is the probability of rejecting the H0 when it is in fact
true.
Suppose we have decided a 5% significance level
5% significance level is denoted by c
By the choice of this critical value, rejection of the null will occur for
5% of all the random samples when null is true
Manini Ojha (JSGP)
EC-14
Fall, 2020
175 / 237
Testing one - sided alternative
One-sided alternative
H0 : βj = 0
Ha : βj > 0
Here rejection rule: tβ̂j > c (right tailed)
Or
H0 : βj = 0
Ha : βj < 0
Here rejection rule: tβ̂j < −c (left-tailed)
Manini Ojha (JSGP)
EC-14
Fall, 2020
176 / 237
Rejection rule and computation of c
Rejection rule in one-tailed test is that H0 is rejected in favor of Ha
at 5% significance level if
(39)
tβ̂j > c
To compute c, we need significance level and df
Rough guide: If df>120, can use standard normal critical values
As the significance level ↓, the critical value ↑
Thus, we need larger and larger values of t-statistic to reject the null
Manini Ojha (JSGP)
EC-14
Fall, 2020
177 / 237
Right-tailed rejection region
Manini Ojha (JSGP)
EC-14
Fall, 2020
178 / 237
Left-tailed rejection region
Manini Ojha (JSGP)
EC-14
Fall, 2020
179 / 237
Example
Hourly wage model
log\
(wage) = 0.284 + 0.092educ + 0.0041exper + 0.022tenure
(0.104) (0.007)
(0.0017)
(0.003)
2
n = 526 R = 0.316
Standard errors are provided in parentheses below the estimated
coefficients
Use this equation to tests whether return to exper , controlling for
educ and tenure is zero in the population against the alternative that
is it positive.
H0 : ?
Ha : ?
How will you compute the t-statistic if significance level chosen is 5%?
?
tβ̂2 = texper = =?
?
Manini Ojha (JSGP)
EC-14
Fall, 2020
180 / 237
Note: Since df=522, can use standard normal critical values
At 5% significance level, critical value c = 1.645
At 1% significance level, critical value c = 2.326
Therefore, from the example equation, since texper ≈ 2.41, we can say
β̂exper or exper is statistically significant even at the 1% level OR
β̂exper is statistically greater than zero even at the 1% significance level
Manini Ojha (JSGP)
EC-14
Fall, 2020
181 / 237
Testing - two-sided alternatives
Two sided alternative
H0 : βj = 0
Ha : βj 6= 0
Rejection rule in two-tailed test - look at absolute value of t-stat
(40)
tβ̂j > c
To compute c , need significance level and df
When specific alternative not stated, usually considered to be
two-sided
Manini Ojha (JSGP)
EC-14
Fall, 2020
182 / 237
Manini Ojha (JSGP)
EC-14
Fall, 2020
183 / 237
Language
If H0 is rejected in favor of H1 at the 5% level, we usually say that
xj is statistically significant at 5% level or
xj is statistically different from zero at the 5% level.
If H0 is not rejected, we say that
xj is statistically not significant at the 5% level or
we fail to reject H0 at the 5% level (rather than, we accept the null)
Manini Ojha (JSGP)
EC-14
Fall, 2020
184 / 237
Testing other hypotheses about βj
Sometimes interested in whether βj is equal to another constant
Common examples
H0 : βj = a
where a is the hypothesized value of βj , then
t=
β̂j − a
(41)
se(β̂j )
t measures how many estimated standard deviations β̂j is away from
the hypothesized value of βj
t=
Manini Ojha (JSGP)
(estimate − hypothesized value)
standard error
EC-14
Fall, 2020
185 / 237
If
H0 : βj = 1
H1 : βj > 1
then find critical value for one-sided alternative exactly as before
We reject the H0 in favor of H1 if t > c =⇒β̂j is statistically greater
than one at appropriate significance level
If
H0 : βj = 1
H1 : βj 6= 1
then find critical value for two-sided alternative exactly as before
We reject the H0 in favor of H1 if |t| > c=⇒β̂j is statistically different
than one at appropriate significance level
Manini Ojha (JSGP)
EC-14
Fall, 2020
186 / 237
If
H0 : βj = −1
H1 : βj 6= −1
then find critical value for two-sided alternative exactly as before
t = (β̂j + 1)/se(β̂j )
We reject the H0 in favor of H1 if |t| > c=⇒β̂j is statistically different
than negative one at appropriate significance level
∴ Difference is in how we compute the t - stat, not how we obtain c
Manini Ojha (JSGP)
EC-14
Fall, 2020
187 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
State the alternative hypothesis
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
State the alternative hypothesis
Choose a significance level
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
State the alternative hypothesis
Choose a significance level
Then determine a critical value based on df and significance level
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
State the alternative hypothesis
Choose a significance level
Then determine a critical value based on df and significance level
Compute the value of the t-statistic
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Recap - hypothesis testing
To test hypothesis using classical approach:
State the alternative hypothesis
Choose a significance level
Then determine a critical value based on df and significance level
Compute the value of the t-statistic
Compare t-statistic with the critical value
the null is either rejected or not rejected at the given significance level
Manini Ojha (JSGP)
EC-14
Fall, 2020
188 / 237
Manini Ojha (JSGP)
EC-14
Fall, 2020
189 / 237
p-values
Rather than testing at different significance levels, can try to answer
the following:
Given the value of the t-stat, what is the smallest significance level at
which the null would be rejected?
This is known as the p-value for the test
It is a probability and always lies between 0 and 1
Manually computing requires detailed printed t-tables
Regression package will do it for you
Almost always its the p-value for testing the null against the two-sided
alternative
Manini Ojha (JSGP)
EC-14
Fall, 2020
190 / 237
p-value summarizes the strength or weakness of an empirical evidence
against the null
Small p-values are evidence against the null
Large p-values provide little evidence against the null
If α denotes the significance level of the test (in decimal), then
we reject the H0 if p-value < α
we fail to reject the H0 if p-value > α
Manini Ojha (JSGP)
EC-14
Fall, 2020
191 / 237
Economic significance vs statistical significance
Statistical significance of a variable xj is determined entirely by the
size of tβ̂j
Economic significance of a variable is related to the size and sign of β̂j
Too much emphasis on statistical significance may lead to false
conclusions that estimate is important even though the estimated
effect is small
driven by both size of β̂j and se(β̂j )
Manini Ojha (JSGP)
EC-14
Fall, 2020
192 / 237
Example:
log\
(wage) = 80.29 + 5.44educ + 0.269exper − 0.00013tenure
(0.78)
(0.52)
(0.045)
(0.00004)
2
n = 1534 R = 0.100
Discuss the statistical (compute t-stat) vs economic significance of
tenure on predicted log (wage)
Small sample size −→ less precise estimators, higher standard errors
Large sample size −→ more precise estimators, smaller standard errors
in comparison to coefficient estimate
Sometimes large standard errors because of high multicollinearity Rj2
even if sample size is fairly large
Manini Ojha (JSGP)
EC-14
Fall, 2020
193 / 237
Confidence Intervals
Also called interval estimates
provide a range of likely values for the population parameter rather
than just a point
Given that
β̂j − βj
se(β̂j )
∼ tn−k−1
Confidence interval (CI) for unknown βj is constructed as
β̂j ± c.se(β̂j )
(42)
where c is the critical value
Manini Ojha (JSGP)
EC-14
Fall, 2020
194 / 237
Confidence Intervals
Lower bound of CI
βj ∼ β̂j − c· se(β̂j )
(43)
β j ∼ β̂j + c· se(β̂j )
(44)
Upper bound of CI
Manini Ojha (JSGP)
EC-14
Fall, 2020
195 / 237
Example:
1
df=25, a 95% CI for any βj is given by
[β̂j − 2.06· se(β̂j ), β̂j + 2.06· se(β̂j )]
2
df=25, a 90% CI for any βj is given by
[β̂j − 1.71· se(β̂j ), β̂j + 1.71· se(β̂j )]
3
df=25, a 99% CI for any βj is given by
[β̂j − 2.79· se(β̂j ), β̂j + 2.79· se(β̂j )]
Example: for df=120, (using normal dbn) a 95% CI for any βj is given
by
[β̂j − 1.96· se(β̂j ), β̂j + 1.96· se(β̂j )]
Manini Ojha (JSGP)
EC-14
Fall, 2020
196 / 237
Testing hypothesis : single linear combination of parameters
Till now we’ve seen how to test hypothesis about single unknown
parameter βj
In application, often required to test hypothesis about many
population parameters, sometimes a combination of them
Example: Kane & Rouse (1995) : population includes working people
with HS degree
log (wage) = β0 + β1 jc + β2 univ + β3 exper + u
(45)
jc: # years of attending 2 year college
univ : # years of attending 4 year college
exper : # months in the workforce
Manini Ojha (JSGP)
EC-14
Fall, 2020
197 / 237
If hypothesis of interest is whether 1 year at jc is worth 1 year at uni,
then
H0 : β1 = β2
i.e. 1 more year at jc and 1 more yr at uni lead to same ceteris paribus
increase in wage
Alternative of interest is one-sided: a year at a junior college is worth
less than a year at a university
H1 : β1 < β2
Here, rewrite the null and alternative as
H0 : β1 − β2 = 0
H1 : β1 − β2 < 0
Manini Ojha (JSGP)
EC-14
Fall, 2020
198 / 237
Construct t- stat as
t=
β̂1 − β̂2
se(β̂1 − β̂2 )
t- stat is based on whether the estimated difference β̂1 − β̂2 is
sufficiently less than zero to warrant rejection of the null
Then choose significance level and based on df, compute critical value
and test
Difficulty lies in getting se(β̂1 − β̂2 )
Caution!
se(β̂1 − β̂2 ) 6= se(β̂1 ) − se(β̂2 )
Recall
Var (β̂1 − β̂2 ) = Var (β̂1 ) + Var (β̂2 ) − 2Cov (β̂1 , β̂2 )
q
se(β̂1 − β̂2 ) = Var (β̂1 ) + Var (β̂2 ) − 2Cov (β̂1 , β̂2 )
Manini Ojha (JSGP)
EC-14
Fall, 2020
199 / 237
Easier method to do this is as follows:
Define a new parameter
θ 1 = β1 − β2
(46)
Null and alternative respectively become:
H0: θ1 = 0
H1: θ1 < 0
Manini Ojha (JSGP)
EC-14
Fall, 2020
200 / 237
Model becomes (subbing Eqn 46 in Eqn 45):
log (wage) = β0 + β1 jc + β2 univ + β3 exper + u
= β0 + (θ1 + β2 )jc + β2 univ + β3 exper + u
= β0 + θ1 jc + β2 (jc + univ ) + β3 exper + u
log (wage) = β0 + θ1 jc + β2 totcoll + β3 exper + u
(47)
If we want to directly estimate θ1 and obtain its se, then we must
construct the new variable jc + univ = totcoll and include it in the
regression
i.e. total years of college
Manini Ojha (JSGP)
EC-14
Fall, 2020
201 / 237
Eqn 47 is simply a way of reformulating the original model Eqn 45.
Can compare the coefficients and se and check if the reformulation is
correct
β1 disappears and θ1 appears explicitly
β0 remains same (in fact you will see, estimate and se of these will be
the same)
β3 remains same (in fact you will see, estimate and se of these will be
the same)
Coeff on new variable totcoll and its se will also be the same as before
log (wage) = 1.472 + 0.0667jc + 0.0769univ + 0.0049exper + u
(0.021) (0.0068)
(0.0023)
(0.0002)
log (wage) = 1.472 + 0.0102jc + 0.0769totcoll + 0.0049exper + u
(0.021) (0.0069)
Manini Ojha (JSGP)
EC-14
(0.0023)
(0.0002)
Fall, 2020
202 / 237
Reformulation is done so we can estimate θ1 directly and get its se
directly
Can compute CI at 95% confidence level for θ1 = β1 − β2 as
θ̂1 ± 1.96· se(θ̂1 )
Manini Ojha (JSGP)
EC-14
Fall, 2020
203 / 237
Testing hypothesis: multiple linear restrictions
What to do if interested in testing whether a set/group of independent
variables has no partial effect on a dependent variable?
Model
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + u
Null:
H0 : β2 = 0, β3 = 0, β4 = 0
Null here =⇒ once x1 and x5 have been controlled for x2, x3, x4 have no
effect on y and can be excluded from the model
These are called multiple restrictions
Test of multiple restrictions is called multiple hypotheses test or a
joint hypotheses test
Manini Ojha (JSGP)
EC-14
Fall, 2020
204 / 237
What should be the alternative?
Ha :H0 is not true
This would hold if at least one of β2 , β3 , or β4 is different from zero
(any or all could be different from 0)
Manini Ojha (JSGP)
EC-14
Fall, 2020
205 / 237
Cannot use t- test to see whether each variable is individually
significant.
An individual t- test does not put any restrictions on the other
parameters
Another way of testing joint hypotheses where
SSR and R 2 play a role
Recall, since OLS estimates are chosen to minimize SSR
SSR always ↑ when x 0 s are dropped from the model : restricted model
Compare SSR in the model with all of the variables (unrestricted
model) with SSR where x 0 s are dropped and check rejection rule
Manini Ojha (JSGP)
EC-14
Fall, 2020
206 / 237
Unrestricted model: model with more parameters
y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 + u
(48)
Restricted model: model with fewer parameters than unrestricted
model
y = β0 + β1 x1 + β5 x5 + v
(49)
Question:
1
2
Which model out of Eqn. 48 and Eqn. 49 will have greater SSR?
Which model out of Eqn. 48 and Eqn. 49 will have greater R 2 ?
Manini Ojha (JSGP)
EC-14
Fall, 2020
207 / 237
SSRr
> SSRur
SSEr
< SSEur
=⇒
Rr2
2
< Rur
(50)
The way to test multiple restrictions / joint hypotheses is by using F statistic/ F − ratio:
F =
(SSRr − SSRur )/q
SSRur /(n − k − 1)
Where q : number of restrictions (in our example q = 3)
Since SSRr is no smaller than SSRur , F stat is always non-negative
(strictly positive)
Manini Ojha (JSGP)
EC-14
Fall, 2020
208 / 237
Since the restricted model has fewer parameters—and each model is
estimated using the same n observations
dfr is always greater than dfur
F statistic distributed as r.v.
F ∼ Fq,n−k−1
Rejection rule: reject H0 in favor of H1 at the chosen significance
level if
F >c
Then we say β2 , β3 , β4 are jointly statistically significant or jointly
statistically different from zero.
Manini Ojha (JSGP)
EC-14
Fall, 2020
209 / 237
F - statistic can also be written as
F =
2 − R 2 )/q
(Rur
r
2 )/n − k − 1
(1 − Rur
2
∵SSRur = SST (1 − Rur
) and
SSRr = SST (1 − Rr2 )
Called the R 2 form of F - statistic
easier to use this to compute F stat since R 2 always reported in all
software packages, while SSR may not be
2 comes first as R 2 > R 2 (refer to
Note: Here in the numerator Rur
ur
r
Eqn. 50)
Manini Ojha (JSGP)
EC-14
Fall, 2020
210 / 237
F statistic for overall significance of the model
A special set of exclusion restrictions
These restrictions have the same interpretation, regardless of the
model.
In the model with k independent variables, we can write the null
hypothesis as
H0 : x1 , x2 , ......., xk do not help to explain y
This null hypothesis is very pessimistic <insert relevant emoji>
It states that none of the explanatory variables has an effect on y
H0 : β1 = β2 = ....... = βk = 0
Alternative is that at least one of the βj is different from zero
Manini Ojha (JSGP)
EC-14
Fall, 2020
211 / 237
How many restrictions are there?
k restrictions
q=k
Restricted model here looks like
y = β0 + u
What is the Rr2 ?
Zero (none of the variation in y is being explained because there are no
explanatory variables)
Thus, F - stat
F =
Manini Ojha (JSGP)
R 2 /k
R 2 /q
=
(1 − R 2 )/(n − k − 1)
(1 − R 2 )/(n − k − 1)
EC-14
Fall, 2020
212 / 237
Testing general linear restrictions
Testing exclusion restrictions is by far the most important application
of F statistics
But sometimes, restrictions implied by a theory are more complicated
than just excluding some independent variables.
log (price) = β0 +β1 log (assess)+β2 log (lotsize)+β3 log (sqrft)+β4 bdrms+u
where
price = house price;
assess= the assessed housing value (before the house was sold);
lotsize = size of the lot, in square feet;
sqrft = square footage;
bdrms = number of bedrooms
Manini Ojha (JSGP)
EC-14
Fall, 2020
213 / 237
Testing general linear restrictions
Say want to test the following
H0 :β1 = 1, β2 = 0, β3 = 0, β4 = 0
What does this null mean?
How many restrictions are there?
How will you test this?
[Hint: write the restricted model, compare with unrestricted, get
F -stat]
Can you use the R 2 form of F stat here?
Do it!
Manini Ojha (JSGP)
EC-14
Fall, 2020
214 / 237
Nested & non-nested models
When you have two equations and neither equation is a special case of
the other,
called non-nested models
The F - statistic only allows us to test nested models
one model (the restricted model) is a special case of the other model
(the unrestricted model)
Manini Ojha (JSGP)
EC-14
Fall, 2020
215 / 237
Effects of data scaling on OLS Statistics
When variables are rescaled,
the coefficients,
standard errors,
confidence intervals,
t statistics, and
F statistics
change in ways that preserve all measured effects and testing outcomes
Manini Ojha (JSGP)
EC-14
Fall, 2020
216 / 237
Data scaling is often used for cosmetic purposes
to reduce the number of zeros after a decimal point in an estimated
coefficient
to improve the appearance of an estimated equation while changing
nothing that is essential
Manini Ojha (JSGP)
EC-14
Fall, 2020
217 / 237
Example
Consider the equation relating infant birth weight to cigarette smoking
and family income:
\ = β̂0 + β̂1 cigs + β̂2 faminc
bwght
where
bwght is child birth weight, in ounces
cigs is number of cigarettes smoked by the mother while pregnant, per
day
faminc is annual family income, in thousands of dollars
Manini Ojha (JSGP)
EC-14
Fall, 2020
218 / 237
Manini Ojha (JSGP)
EC-14
Fall, 2020
219 / 237
Estimate on cigs says
if a woman smoked 5 more cigarettes per day, birth weight is predicted
to be about .4634(5) = 2.317 ounces less
t - stat on cigs is −5.06 (v. statistically significant)
Change unit of measurement for dependent var:
Suppose now we decide to measure birth weight in pounds instead of
ounces
Let bwghtlbs = bwght/16 be birth weight in pounds
What happens to OLS statistics?
Essentially dividing the entire equation by 16
Verify by looking at col (2)
Manini Ojha (JSGP)
EC-14
Fall, 2020
220 / 237
Estimates in col. (2) = col. (1) /16
coefficient on cigs = −.0289
if cigs were higher by five, birth weight would be .0289(5) = 0.1445
pounds lower
Convert to ounces −→.1445(16) = 2.312
slightly different from the 2.317 we obtained earlier (due to rounding
error)
Point being: once the effects are transformed into the same units, we
get exactly the same answer, regardless of how the dependent variable
is measured
Manini Ojha (JSGP)
EC-14
Fall, 2020
221 / 237
What happens to statistical significance?
Changing the dependent variable from ounces to pounds has no effect
on how statistically important the independent variables are
t-stat in col. (2) are identical to t-stat in col. (1)
end-points for the CIs in col (2) are the endpoints in col. (1) divided by
16
R-squareds from the two regressions are identical
SSRs differ (SSR in col. (2) = SSR in col. (1)/256)
Manini Ojha (JSGP)
EC-14
Fall, 2020
222 / 237
Change the unit of measurement of one of the independent
variables, cigs.
Define packs to be the number of packs of cigarettes smoked per day
What happens to the coefficients and other OLS statistics now?
Look at col. (3) [ t - stat same, se differ]
Why have we not included both cigs and packs in the same equation?
Manini Ojha (JSGP)
EC-14
Fall, 2020
223 / 237
Note: changing the unit of measurement of the dependent variable,
when it appears in logarithmic form, does not affect any of the
slope estimates
only the intercept changes
Note: changing the unit of measurement of any xj , where log (xj )
appears in the regression does not affect slope estimates
only the intercept changes
Manini Ojha (JSGP)
EC-14
Fall, 2020
224 / 237
Standardized coefficients /Beta coefficients
Sometimes, in econometric applications, a key variable is measured on
a scale that is difficult to interpret
example effect of test scores on wages: usually interested in how a
particular individual’s scores compares to the population
difficult to visualize what would happen to wages if test score increased
by 10 points
makes more sense to think of it in terms of what happens to wages if
test score is one standard deviation higher
Thus, sometimes, it is useful to obtain regression results when all
variables (dependent and independent vars) have been standardized
Manini Ojha (JSGP)
EC-14
Fall, 2020
225 / 237
A variable is standardized in the sample by subtracting off its mean
and dividing by its standard deviation
i.e. compute the z-score for every variable in the sample
run the regression using the z-scores
Manini Ojha (JSGP)
EC-14
Fall, 2020
226 / 237
Start with original OLS equation:
yi = β̂0 + β̂1 xi1 + β̂2 xi2 + ..... + β̂k xik + ûi
Beta coefficients or standardized coefficients (b̂i ) are
simply the original coefficient β̂1 multiplied by the ratio of the standard
deviation of x1 to the standard deviation of y i.e. σ̂σ̂1y .β̂1
no need to know the proof: take the average across the original
equation, subtract from original and divide by standard deviations
Manini Ojha (JSGP)
EC-14
Fall, 2020
227 / 237
Interpretation of beta coefficients
If x1 ↑ by one s.d. , then ŷ changes by b̂1 s.d.
Measuring effects not in terms of the original units of y or xj , but in
standard deviation units
In a standard OLS equation, cannot look at the size of different
coefficients and conclude
explanatory variable with the largest coefficient is “the most important.”
But, when each xj has been standardized
compelling to compare magnitudes of resulting beta coefficients
Note: when the regression equation has only a single explanatory
variable, x1 :
its standardized coefficient is simply the sample correlation coefficient
between y and x1
must lie in the range −1 to 1.
Manini Ojha (JSGP)
EC-14
Fall, 2020
228 / 237
Models with Logs in functional forms
Interpret coefficients here:
log\
(price) = 9.23 − 0.718log (nox) + 0.306rooms
β1 is the elasticity of price w.r.t. nox (pollution)
β2 is the change in log (price) when 4rooms = 1
multiply this by 100 and you will get an approximate percentage
change in price
recall: 100.β2 also called the semi-elasticity of price wrt to rooms
30.6%
Manini Ojha (JSGP)
EC-14
Fall, 2020
229 / 237
But till now, this was a simplistic interpretation
Exact interpretation in a log-level functional form:
%4ŷ = 100.[e β̂2 4x2 − 1]
or = 100.[exp(β̂2 4x2 ) − 1]
Thus, exact interpretation in housing price example of β̂2 is:
when rooms increase by 1 or 4rooms = 1, then percentage change in
price
\ = 100[exp(0.306) − 1]
%4price
= 35.8%
Manini Ojha (JSGP)
EC-14
Fall, 2020
230 / 237
More on logs
Advantages
1
We can be ignorant about units of measurement of variables appearing
in log
slope coefficients are invariant to rescaling
2
3
When y > 0, models using log (y ) as the dependent variable often
satisfy the CLRM assumptions more closely than models using the level
of y . (since distribution looks more like a normal)
Taking the log of a variable often narrows its range. Narrowing the
range of the y and x 0 s can make OLS estimates less sensitive to outliers
Manini Ojha (JSGP)
EC-14
Fall, 2020
231 / 237
More on logs
Disadvantage
1
Sometimes, log transformation can actually create extreme values
when a y is between 0 and 1 (such as a proportion) and takes on values
close to zero, log (y ) can be very large in magnitude
2
Cannot be used if a variable takes on zero or negative values
Manini Ojha (JSGP)
EC-14
Fall, 2020
232 / 237
Models with Quadratics
Quadratic functions are also used quite often in applied economics to
capture decreasing or increasing marginal effects
ŷ = β̂0 + β̂1 x + β̂2 x 2
Here,
4ŷ
≈ β̂1 + 2β̂2 x
4x
The relationship between x and y depends on the value of x
Manini Ojha (JSGP)
EC-14
Fall, 2020
233 / 237
Example:
wage
[ = 3.73 + 0.298exper − 0.0061exper 2
(0.35) (0.041)
(0.0009)
2
n = 526, R = 0.093
Estimated equation implies that exper has a diminishing effect on wage
What is the shape of the quadratic in this case (coeff on x is positive
and coeff on x 2 is negative)?
Find the maximum of the function
Manini Ojha (JSGP)
EC-14
Fall, 2020
234 / 237
When models have quadratics
shape can be U-shaped (β1 negative and β2 positive) or
parabolic (β1 positive and β2 negative)
Manini Ojha (JSGP)
EC-14
Fall, 2020
235 / 237
Models with interaction terms
Sometimes, partial effect, elasticity, or semi-elasticity of the y wrt an
explanatory variable may also depend on the magnitude of another
explanatory variable
price = β0 + β1 sqrft + β2 bdrms + β3 sqrft· bdrms + β4 bthrms + u
Here partial effect of bdrms on price is given by
4price
= β2 + β3 sqrft
4bdrms
If β3 > 0, then
an additional bedroom leads to higher housing price for larger houses
Manini Ojha (JSGP)
EC-14
Fall, 2020
236 / 237
Interaction term: leads to an interaction effect between square
footage and # bedrooms
For summarizing interaction effects, typically evaluate the effect at
mean value, upper quartile, lower quartile
i.e. evaluate the effect of bdrms on price at mean value, upper and
lower quartile, max and min, of sqrft
Interesting to look at average partial effects (APE):
β2 + β3 sqrft
Manini Ojha (JSGP)
EC-14
Fall, 2020
237 / 237
Download