New York University - Local Governance Research Laboratory

advertisement
MA Econometrics I
Dr. Arnaud Chevalier
Department of Economics
University College Dublin
September 2003
Table of Contents
INTRODUCTION
LECTURE 2: THE CLASSICAL LINEAR REGRESSION MODEL
LECTURE 3: MULTIVARIATE MODEL AND HYPOTHESIS TESTING
1
Introduction to Econometrics and the Classical Linear Regression Model
Wooldridge, Chapter 2,3
2
1.1
Introduction to Econometrics
What is econometrics and what do econometricians do? Basically, they try to answer questions as
diverse as making forecast, assess the safety of nuclear power plants, test theories to make extra
profits on the stock exchange or evaluate the efficiency of policies.
So you may need
econometrics not only to get your Master but also in various jobs (industry, banking, consultancy,
academia…).
Compared with mathematical statistics, the difference is that the econometricians rely mostly on
observational data rather than experimental data. This creates specific problems theoretically but
also empirically. So for this course, we will put the emphasis on how to conduct an empirical
econometric analysis, and then introduce the necessary theory.
Because we use data to answer quantitative questions, our answers will vary if we use a different
set of data. Thus, not only should we provide an answer but also a measure of how precise this
answer is. (Unlike in the Hitchhiker guide to the Galaxy, were the answer is always 42!).
Generally, an econometric analysis is conducted to test some hypotheses, so after presenting the
simple model, we will move on to the basic of testing.
Data specificity means that we will have to depart from the simple model in more than one way,
which is what the various remaining chapters are all about. For, Malinvaud (1966) - ‘The art of
the econometrician consists in finding the set of assumptions which are both sufficiently specific
and sufficiently realistic to allow him take the best possible advantage of the data available to
him’ Hendry says not to confuse econometrics with economic-tricks or economystics!!!
Whilst experimental data will greatly simplify our tasks, such data are rarely available in social
sciences (ethic reason). In this course, we will be concerned with two types of data: cross section,
when typically a large (random) sample of the population of interest is surveyed at one point in
time and time series, where the evolution of a few variables over time is recorded. Cross section
can also be pooled across time, in order to increase the sample size and more importantly to
assess how a key relationship changes over time (for example after a policy change). Cross
section data is typically used in micro-economic analysis while time series are more associated
3
with macro-analysis and finance. The analysis of time series is complicated by issues of
seasonality, trend and persistence, but we will see how to deal with these issues.
Other types of data exist, such as panel (longitudinal data), where observations are recorded time
after time (for example, yearly survey of the same household), and duration where the
econometrician is interested in the lapse of time between two events (given a treatment and death,
unemployment). Each data type allows you to answer specific questions but has its own difficulty.
So the first task in any project is to determine, what is the appropriate data to answer your
question of choice. And remember, bad data will always make the life of the econometrician
much harder.
12
The Paradigm of Econometrics
Existence of an underlying “structure,” a “true” model of an economic phenomenon. We have
long studied economic theories of optimization, models of labor supply, demand equations etc.
Moving forward to studying econometrics allows for a number of other issues to be formally
assessed, such as understanding covariation, predicting outcomes based on knowledge and partial
forecasts and controlling future outcomes using knowledge of relationships.
The other debate of course is between the overall merits of econometrics as a means of testing
economic theory - a number of issues arise in the econometrician's approach (often as a means to
simplify estimation and maintain statistical validity) that cause problems. Looking back as far
as the 1930's Keynes entered a debate with Tinbergen (the father of the econometric discipline
and its first Nobel prize winner) in the Economic Journal which covers many of the issues we
will deal with in this course.
(i) Omitted Variables
Keynes (p.559) quotes from Tinbergen’s book that (p.559) “The part which the statistician can
play in the process of the analysis must not be misunderstood. The theories which he submits to
examination are handed over to him by the economist; and with the economist the responsibility
for them must remain; for no statistical test can prove a theory to be correct”. Tinbergen does go
on to state, again reported by Keynes (p.559) that “It can, indeed, prove that theory to be
incorrect, or at least incomplete, by showing that it does not cover a particular set of facts.”
4
Keynes claims that this implies that the statistician therefore requires “the economist having
furnished ... a complete list” to be able to assign statistical properties to the estimators.
(ii) Unobservable Variables
Tinbergen (quoted by Keynes (p.561)) states that “The inquiry is, by its nature, restricted to the
examination of the measurable phenomena. Non-measurable phenomena may, of course, at times
exercise an important influence on the course of events; and the result of the present analysis
must be supplemented by such information about the extent of that influence as can be obtained
from other sources”. Keynes list as unmeasurable and potentially important variables “political,
social and psychological , including such things as government policy, the progress of invention
and the state of expectation”.
(iii) Linearity
Tinbergen (quoted by Keynes (p.563)) notes that “As a rule, curvilinear relations are considered
in the following studies only in so far as strong evidence exists. A rough way of introducing the
most important features of curvilinear relations is to use changing coefficients ... Another way ..
is to take squares of variates or still other functions, among the ‘explanatory series’ ”. However,
Keynes is sceptical arguing that (p. 564) “it is a very drastic and indeed usually improbable
postulate to suppose that all economic forces are of this character, producing independent changes
in the phenomenon under investigation which are directly proportional to the changes themselves;
indeed this is ridiculous”.
(iv) Specification
Keynes notes that (p.155): “It will be remembered that the seventy translators of the Septuagint
were shut up in seventy separate rooms with the Hebrew text and brought out with them, when
they emerged, seventy identical translations. Would the same miracle be vouchsafed if seventy
multiple correlators were shut up with the same statistical material?”
(v) Structural Change
Keynes believes that to draw any use from the statistical analysis it is important that the model be
stable (p.567) “The first step, therefore, is to break up the period under investigation into a series
of sub-periods , with a view to discovering whether the results of applying our method to the
various sub-periods taken separately are reasonably uniform.”
In fact he argues that economic
problems are sufficient difficult and unstable that (p.567) “the application of the method of
5
multiple correlation to complex economic problems lies in the apparent lack of any adequate
degree of uniformity in the environment”
In summary, therefore Keynes believes a lot more thought is required and he says of Tinbergen
(and of statisticians, in general) (p.559) that “he is much more interested in getting on with the
job than in spending time in deciding whether the job is worth getting on with”.
In Hendry's Alchemy or Science we get some definitions - Science is dealing with material
phenomena and mainly based on observation, experimentation and induction (induction inferring a general law from particular instances), Alchemy is the idea of being able to extract
gold or silver from base metals via a chemical process. According to Hendry the definition of
econometrics is (p.388) “An analysis of the relationship between economic variables ... by
abstracting the main phenomena of interest and stating theories thereof in mathematical form.”
Hendry adds to the list of Keynes points the pressing need to improve the quality of the data,
which has seriously lagged behind the increasingly complex and sophisticated techniques
available to the applied researcher for analysing such data. In fact, the techniques have only
become so sophisticated in an attempt to address the issue of data deficiencies.
However, econometrics has a poor reputation. Worswick - Econometricians are not “engaged in
foraging for tools to arrange and gather facts, so much as making a marvellous array of pretendtools” (1972); Brown - “running regressions between time series is only likely to deceive” (1972);
Leontief - Econometrics as “an attempt to compensate for glaring weaknesses of the data
available to us by the widest possible use of more and more sophisticated statistical techniques”
(1971); Coase - If you torture the data long enough, nature will confess” (p.37); Leamer “Econometricians, like artists, tend to fall in love with their models”.
Leamer draws a distinction between econometrics and science: science is where controlled
experiments are undertaken, while there is clearly some uncertainty associated with these
experiments, the error is small and the conclusions are therefore “tight”. In Economics no
experimentation is practical, the possibilities/uncertainties are boundless and one is only
constrained in the variables used by ones imagination, even here there maybe “influential
monsters lurking beyond our immediate field of vision” , consequently the errors are potentially
enormous. This implies that in economics the researcher would either:
6
(i) look at the data first; however, to look at the data might bias ones judgement, as theories based
on the data are difficult to reject based on looking at the data. To illustrate this point he quotes the
applied researcher (p.40) who thinks “that a certain coefficient should be positive, and their
reaction to the anomalous result of a negative coefficient is to find another variable to include in
the equation so that the estimate is positive. Have they found evidence that the coefficient is
positive?”
This is particularly concerning given that the researcher’s art (p.36) “is practised at
the computer terminal (and) involves fitting many, perhaps thousands, of statistical models. One
or several that the researcher finds pleasing are selected for reporting purposes” - selective
reporting is clearly problematic.
or
(ii) require an infinitely wide field of vision in order to make discoveries, or as Leamer (p.40)
notes “the great human discoveries are made when the horizon is extended for reasons that cannot
be predicted in advance and cannot be computerised. If you wish to make such discoveries, you
will have to poke at the horizon, and poke again.”
To illustrate the problem Leamer of undertaking applied research with a strong prior view of the
model, he uses the example of the effects of execution (denoted PX) on murder rates (denoted M)
in 44 states in the U.S. (of which 35 have executions and 9 are non-executing. He assumes there
are 5 researchers each with a strong prior described below:
No
Individual
Important (key) variables
1
Right winger
Punish works (PC,PX,T)
2
Rational maximiser
Economic return to crime (PC,PX,T,W,X,U,LF)
3
Eye-for-an-eye
Probability of execution (PC,PX)
4
Bleeding Heart
Economic hardship (W,X,U,LF)
5
Crime of Passion
Punishment doubtful
(W,X,U,LF,NW,AGE,URB,MALE,FAMHO,SOUTH)
For example, the right winger simply believes the main determinants of murder rates are the
punishment variables (PC - probability of conviction, PX - probability of execution, and T mediam sentance served for murder), while the Bleeding Heart is solely interested in the
7
economic deprevation variables (W - median income, X - % families <0.5 of W, U unemployment rate, LF - Labour force participation). Each researcher views the listed variables
as the important variables (denoted I) to be included in any model and is prepared to take any
linear combinations of the other doubtful (denoted D) variables (see Table 2) Leamer obtains
Table 3, which reports the sensitivity of the coefficient on PX under varoious alternative models.
Only researchers 1 and 2 obtain consistent coefficient on PX (<0). All others do not find a
consistent coefficient and could therefore find any result they please.
McAleer argues that the problem with investigating the fragility of estimates is considered in the
following example:
If the true model is
yt    1xt  2zt ut
(T)
and you estimate
y t   *  1*x t  v t
(E)
then the OLS estimate of b1* is a biased estimate of 1 , with the bias depending on the sign of  2
and cov(x t , z t ) - and this bias could be substantial. Thus there is no reason to believe that if zt is a
doubtful variable (D in Leamer’s terminology) that the coefficient estimate on xt should be
insensitive to this. McAleer et. al suggest a 5-step methodology to careful analysis:
(i)
Consistency with theory
(ii)
Significance both statistical and economic
(iii)
Indexes of adequacy (“test, test, test” of Hendry)
(iv)
Fragility or sensitivity (to new data rather than EBA)
(v)
Encompassing (should dominate all competing models)
McAleer explains the similarity between applied econometric analysis and criminal deduction by
noting that “Both criminal investigation and econometric analysis involve determining the
importance of and collection of data, and a final explanation of the data after previous
explanations (if any) have been rejected against the available evidence”. However, the two
disciplines depart in that “Like econometricians, Sherlock Holmes is in search of the truth that
generated the data. Confronted with a crime or problem, Holmes assiduously gathers data which
are needed for a suitable explanation. Unlike econometrics, however, his searches will frequently
yield the truth and the culprit will be apprehended”.
8
The difference being that somebody
committed the murder (crime), while in economics the truth, the process that generated the data,
does not exist and at best you might get an adequate approximation of the process that is not
obviously incorrect. However it is still useful to consider how Holmes undertakes a criminal
investigation and this is divided into 5 sections.
(i)
Theory
“I have no data yet. It is a capital mistake to theorise before one has data. Insensibly one begins to
twist facts to suit theories, instead of theories to suit facts”. (Sherlock Holmes to Dr. Watson in A
Scandal in Bohemia). Holmes relied solely upon data in formulating his theories. He had no prior
beliefs before starting a case as this would necessarily limit the number of possible suspects. This
idea contrasts markedly with that of the classical procedure for conducting statistical inference,
whereby formulation of a theory always precedes examination of the data. However, one should,
in the final analysis, ensure that “Data … be given the last word in deciding the validity of a
theory” (p.322).
(ii)
Quality of data
“It is of the highest importance in the art of detection to be able to recognise out of a number of
facts which are incidental and which vital” (Sherlock Holmes to Colonel Hayter in The Adventure
of the Reigate Squire). Holmes, like econometricians, did not have the possibility of conducting
experiments, but would always be prepared to test his theories against new data. Irrelevant data
would always likely to be rejected. Holmes (like economics) was interested in true (not spurious)
relationships that explained how crimes (dependent variable) were perpetrated (explained).
(iii)
Truth
Unlike in economics the idea of truth exists for Holmes, in that somebody committed the crime,
“We must fall back upon the old axiom that when all other contingencies fail, whatever remains,
however improbable, must be the truth” (Sherlock Holmes to Dr. Watson in The Adventures of
the Bruce-Partington Plans). However, the applied econometrician seeks to explain an unknown
and frequently unobservable, relation between numerous interdependent factors - economic
puzzles are far more complex than criminal problems. Even if a relationship exits there is no
reason why this should remain constant over time.
(iv)
Reconciliation with data
9
“What do you think of my theory? … When new facts come to our knowledge which cannot be
covered by it, it will be time enough to reconsider it” (Sherlock Holmes to Dr. Watson in The
Adventure of the Yellow Face). Holmes was frequently willing to change his position to examine
the effects on his theory. Known as statistical robustness in economics, which requires the model
be robust to new data and be reconciled with competing theories.
(v)
Testing
“… it is well to test everything” ((Sherlock Holmes to Dr. Watson in The Adventure of the
Reigate Squire). For Holmes, like econometricians, it is important to test all parts of any theory
for weak links. It is to the credit of Holmes (and some econometricians) that he is prepared to
abandon a cherished theory in the light of new data which contradicts the theory. However,
Holmes makes these decisions in the face of virtual certainty, rather than the very hazy and
uncertain world of economics. In fact this uncertainty has ensured some models have survived
well beyond their use by date.
10
Topic II:
The Classical Linear Regression Model
2.1 The linear regression model
We believe there is a causal link between class size and achievement. However, there is a
tradeoff, more teachers means higher wage bill. To justify her case, the local headteacher is
asking you to estimate the effect of a change in class size on test score.
Basically, you are thinking of the following relationship.

Test
Size
(2.1)
(2.1) can be seen as the definition of a slope coefficient, thus a straight line relating test score to
class size, can be written as:
Test   0   * Size
(2.2)
where  0 is the intercept (test score with a class size of 0!! In some applications, the intercept is
not meaningful).
All smug, you go back to the headteacher who tells you off for not including various other
characteristics of a school, that will affect the test performance. She is right (and we will see in
the next chapter how to deal with her criticism) but for the moment, all we want to say is that
(2.2) is true on average, all these factors (some of them unobservable are grouped into an error
term.
So in general, we believe there is a linear relationship between an independent variable (X) and a
dependent variable (Y). This relationship holds on average, thus for each observation (i) there
exists an error term (ui). Thus, we have the generic equation:
11
Yi   0  1 X i  ui
(2.3)
[insert Slide 6, from Dougherty, CD1] (1)
 0  1 X
is called the population regression line.
0
and
1
are the coefficients
(parameters) to be estimated using the available data.
2.2 Estimating the coefficient of the linear regression model
1 in the population but we estimate it from a
As in statistics, we do not know the value of
sample of data. For example, looking at the data from test score and pupil teacher ratio, how do
we get to estimate  0 and  1 , an eyeball option is not the solution.
Figure: Distribution of test score and pupil teacher ratio
test score
700
650
600
15
20
Student teacher ratio
12
25
Variable
Obs
str
420
testscr
420
correlation str/test = -.023
Mean
19.64043
654.1565
Std. Dev.
1.891812
19.05335
Min
14
605.55
Max
25.8
706.75
So let’s first start with a simplistic example, where we only have three observation points.
Insert slide 2 from Dougherty CD2 (2)
The Ordinary Least Square (OLS) estimator chooses the regression coefficients so that the
estimated regression line is “as close as possible” to the observed data. How do we measure
closeness?
To measure closeness, we rely on the residuals: The difference between the predicted and
observed value of the outcome. So let’s pretend that we have estimates of  0 and  1 , say b0 and
b1. We can define the fitted (predicted) value of Yi as:
Yˆi  b0  b1 X i
(2.4)
e i  Yi  Yˆi
(2.5)
Then the residuals are:
! do not confuse error ( ui  Yi  X i ) and residual …. Expand on this
Insert slide 10 from Dougherty CD1 (3)
The OLS estimates of  0 and  1 , b0 and b1 minimise the sum of the square residuals. Why
minimising the sum of squared residuals?:
-
minimising residuals: positive and negative residuals cancel out.
13
-
Minimising the sum of the absolute values of the residuals: leads to more complicated
calculations.
So how do we calculate b0 and b1?
Insert Dougherty 4 CD2, (4)
So we want to minimize the Residual Sum of Squared (RSS).
RSS  e12  e22  e32  ( 3  b1  b2 ) 2  (5  b1  2b2 ) 2  ( 6  b1  3b2 ) 2

9  b12 
b22  6b1  6b2  2b1b2
 25  b12  4b22  10b1  20b2  4b1b2
 36  b12  9b22  12b1  36b2  6b1b2
 70  3b12  14b22  28b1  62b2  12b1b2
To minimize RSS, the partial derivative of RSS with respect to b0 and b1 should be equal to 0.
(For a minimum, the second derivatives should also be negative).
So we have:
RSS
 0  6b1  12b2  28  0
b1
RSS
 0  12b1  28b2  62  0
b2
 b1=1.67 , b2=1.50
14
In a more general case, we will have more than 3 observations, so RSS will be defined as:
RSS  e12  ...  en2  (Y1  b1  b2 X 1 ) 2  ...  (Yn  b1  b2 X n ) 2

Y12  b12 
b22 X 12 
2b1Y1 
2b2 X 1Y1 
2b1b2 X 1
b22 X n2 
2b1Yn 
2b2 X nYn 
2b1b2 X n
 ...

Yn2  b12 
  Yi 2  nb12  b22  X i2  2b1  Yi  2b2  X iYi  2b1b2  X i
RSS  Yi 2  nb12  b22  X i2  2b1 Yi  2b2  X iYi  2b1b2  X i
RSS
 0  2nb1  2 Yi  2b2  X i  0
b1
nb1   Yi b2  X i
b1  Y  b2 X
RSS
 0  2b2  X i2  2 X iYi  2b2  X i  0
b2
b2  X i2   X iYi  b1  X i  0
Substituting b1 for its value, we get:
b2  X i2   X iYi  (Y  b2 X ) X i  0
So we get:
b2  X i2   X iYi  (Y  b2 X )nX  0
b2  X i2  nX 2    X iYi  nXY
1
 1
b2   X i2  X 2    X iYi  XY
n
 n
b2 Var( X )  Cov( X ,Y )
Cov( X ,Y )
b2 
Var( X )
15
X
X i
n
Alternatively, b2 can be written:
1
 ( X i  X )(Yi  Y )  ( X i  X )(Yi  Y )
b2  n

1
( X i  X )2
2

(
X

X
)
 i
n
Or
1
 X iYi  XY  X iYi  nXY
n
b2 

2
2
1
X

n
X
2
2

i
 Xi  X
n
OLS estimates are given by :
b1  Y  b2 X
b2 
(2.6)
Cov( X ,Y )
Var ( X )
Which are all equivalent. In the next lecture, we will introduce matrix notations.
So back to our pupil teacher example:, we get:
Corr ( X , Y ) 
Cov( X , Y )
Var( X ) * Var(Y )
So using summary statistics and correlation coefficients, we can calculate:
b2 = 2.28
and
b1=654.1565 – (-2.28) * 19.64043 = 698.9
An increase in the number of students per class by 1, is associated on average with a reduction in
test score of 2.28 points. Alternatively, we can predict that in a school where the pupil teacher
ratio averages 20, the average test score will be 653.3.
16
test score
Fitted values
700
650
600
15
20
Student teacher ratio
25
Econometrics is all about providing answer to practical problems, so, what should our advice to
the head teacher be?
Let assume, that the school has the median characteristics of our sample: str=19.7, test=654.5.
Table 2: Distribution of Student teacher ration and test score
STr
Test
1%
15.13898
612.65
5%
16.41658
623.15
10%
17.34573
630.375
25%
18.58179
640
50%
19.72321
654.45
75%
20.87183
666.675
90%
21.87561
679.1
95%
22.64514
685.5
99%
24.88889
698.45
Reducing the str by 2 pupils, will move the school to the top 10% on the STR, and will increase
the average test score of the school to 660 points (just short of the 60%), so the school will almost
become one of the top 40% performer in the country, with the best 10% student teacher ratio.
17
Depending on the cost of extra teachers and how much parents value extra test point, the decision
to hire new teachers is cost-beneficial.
What if the head teachers had more radical plans, and wanted to cut the STR to 10?
Here we cannot say anything because we do not observe schools with such a small ratio. Our
inference will be solely based on the linearity assumption of the OLS. (like our prediction of the
test score in a class with no pupil). This is an identification problem. (Manski, 1995) draw an
example.
Identification problems cannot be solved by collecting more of the same data
Inference can only be safely made for value for which we have some data. Out of sample
prediction will be unpredictable. The relationship between STR and Test may be really different
for really small or large value of the STR, but with the available data we have no way of
knowing.
So we have solved our first econometric problems, let’s review now what we have learnt and
state clearly the assumptions that are necessary to make this inference.
2.3 Assumptions behind OLS
Here are the assumption needed for OLS to provide an appropriate estimator of the unknown
regression coefficients  0 and  1 .
-
Assumption 1: The conditional mean of u is 0
E(ui / X i )  E(ui )  0
(A2.1)
18
E(u)=0 : The unobservable characteristics affecting the outcome of interest have on average a
mean of 0. This is not really restrictive and can be obtained by normalization. For example,
teacher quality affects test score, but within our sample, the mean quality of teachers is 0.
E(u/X)=0 : This is the most important part of the assumption, it states that the average value of u
does not depend on the value of X. The observed characteristics and the error terms are
uncorrelated. If Xi and ui are correlated then the conditional mean assumption is violated, and
OLS estimates are biased.
Figure 1: Conditional probability distribution and population regression function
f(u)
Y
X1
X2
X3
X
The conditional mean assumption is also crucial to derive that:
E(Y / X )   0  1 X
19
The Population Regression Function is equal to the conditional mean of Y. E(Y/X) is a linear
function of X. Thus, we can make statements such as an increase of X of 1 unit, leads to a change
of Y of  1 .
-
Assumption 2: (Xi,Yi) are independently and identically distributed.
Observations have been drawn at random for the population and are independent of each other.
This assumption is usually broken in Time Series, interest rate today are not independent of their
value yesterday.
-
Assumption 3: Population variance of u is constant for all i.
Formally, this condition can be written as:  u2i   2 i
Of course  2 is unknown. This property is known as homoskedasticity (constant variance). Draw
heteroskedasticity on Figure 1.
These assumptions are sometime referred to as the Gauss-Markov conditions.
-
Assumption 4: Normality assumption
One usually assumes that the error term is normally distributed. This will be especially useful
when hypothesis testing.
2.4 Properties of OLS
- Why among the numerous estimates created by econometricians, OLS is the most popular?
Let’s define a few more concepts:
n
- Total Sum of Square (SST): SST   ( y i  y ) 2
i 1
Dispersion of the outcome of interest around the mean
20
n
- Explained Sum of Square (SSE): SSE   ( yˆ i  y ) 2
i 1
SSE measures the estimates of y (since y  yˆ )
n
-Residual Sum of Square: SSR   e i
2
i 1
SSR measures the sample variation in the residuals.
We have SST=SSE+SSR.
To measure how well OLS fits the data, we can look at the ratio of the explained and total
variance; this ratio is called the R2.
R2 
SSE
SSR
1
SST
SST
If the fit is perfect SSR=0, and R2=1.
If OLS fit is bad, SSE=0 and R2=0.
OLS provides unbiased estimates of  0 and  1 .
Proof:
Cov( X , Y ) Cov( X ,  1   2 X  u)

Var( X )
Var( X )
Cov( X ,  1 )  Cov( X ,  2 X )  Cov( X , u)

Var( X )
0   2Cov( X , X )  Cov( X , u)

Var( X )
Cov( X , u)
 2 
Var( X )
b2 
Cov( X , u) 

 Cov( X , u) 
E (b2 )  E   2 
  E(2 )  E

Var( X ) 

 Var( X ) 
1
21
 2 
E Cov( X , u) 
Var( X )
 2
E (b1 )  E (  1   2 X  u  b2 X )
 E (  1 )  E (  2 X )  E ( u )  E (  b2 X )
  1   2 X  0  XE (b2 )   1   2 X  X 2   1
b1  Y  b2 X  1   2 X  u  b2 X
b0 and b1 are unbiased estimates of  0 and  1 . In fact it can be proven that under the GaussMarkov assumptions, OLS is the Best Linear Unbiased Estimate.
Remember that the value of b0 and b1 that we estimate are specific to the sample used. If we have
by chance used a non-representative sample, our point estimate will be far from the true value. If
our sample is representative, then the larger the sample, the closer to the true value we are likely
to be (Central Limit Theorem).
Estimate variance:
b0 and b1 are random variables (depend on the sample), so they have a distribution. Here, we only
state the expression of their variance.
Var(b0 ) 
Var(b1 ) 
2 
X2 
1 

n  Var( X ) 
2
nVar( X )
These formulae are only valid in the presence of homoskedasticy and in the hypothetical case
were  2 is known. For all purpose, we are mostly interested in Var(b1). i) We can see that the
larger the error variance, the larger the variance of our estimate. ii) The more variability in the
independent variable, the more precise our estimate.
[slide 10, in Dougherty, C3D3] (5)
ˆ 2 
1 n 2
SSR
ei 

n  2 i 1
n2
ˆ 2 is interchangeably called the standard error of the regression or the root mean squared
error (in stata). Standard error of our estimates can now be produced.
22
ˆ 2
23
Lecture 3: Multivariate model and hypothesis testing
Last week, we studied the relationship between class size and test score, but we put a cautious
note on our results. The hypothesis that the class size (Xi) and the error terms (ui) were
uncorrelated appeared dubious. If untrue we know that OLS estimates of 1 and  0 , say b1 and
b0 are biased:
b2   2 
Cov( X , u)
Var( X )
This problem was due to omitted factors, which we think affect class size and student scores. For
example, richer parents may put their children in schools with smaller class size, and pay for extra
tuition. To limit omitted variable bias, we use multi-variate regression. By including more
regressors, we can estimate the effect of class size on score, holding constant these other
variables.
In the second part of this lecture, we build confidence intervals for our estimates and review
various tests.
3.1 A simple example
Say that we are interested in the effect of education on earnings but we are concerned that ability
also affects education and earnings, so in order to estimate the unbiased effect of education on
earnings we want to control for ability. If ability is not included in the model, then the error term
will be correlated with education, and lead to biased estimate.
Thus, we have the following specified model:
ln Y   0  1 S   2 AS  u
(3.1)
24
Effect of education and ability on log earnings
 0  1 S   2 AS  u
o
Combined effect
ln Y
Pure effect
of ability
Pure effect of
education
AS
S
To estimate the coefficients we, as before, minimize the sum of squares residuals (RSS)
n
RSS   ei2
i 1
where
ei  Yi  Yˆi  Yi  b1  b2 X 2 i  b3 X 3 i
The first order conditions for a minimum are:
n
dRSS
 2 (Yi  b1  b2 X 2i  b3 X 3i )  0
db1
i 1
n
dRSS
 2 X 2i (Yi  b1  b2 X 2i  b3 X 3i )  0
db2
i 1
n
dRSS
 2 X 3i (Yi  b1  b2 X 2i  b3 X 3i )  0
db2
i 1
25
hence:
b1  Y  b2 X 2  b3 X 3
b2 
b3 
Cov( X 2 ,Y )Var( X 3 ) - Cov( X 3 ,Y )Cov( X 2 , X 3 )
2
Var( X 2 )Var(X 3 )  Cov( X 2 , X 3 )
Cov( X 3 ,Y )Var( X 2 ) - Cov( X 2 ,Y )Cov( X 2 , X 3 )
2
Var( X 2 )Var(X 3 )  Cov( X 2 , X 3 )
(3.2)
Multiple regression analysis allows one to discriminate between the effects of the explanatory
variables, making allowance for their possible correlation. The coefficient of each X variable
provides an estimate of its influence on Y controlling for all other X variables. We can
demonstrate this and see it in a simple example.
As in the simple model, b1 is the intercept, b2 and b3 are the slope coefficients of X2, and X3
respectively. The interpretation of b2 is now the effect on Y of a unit change in X2 holding X3
constant. X3 is often called a control variable.
We are interested in the change in Y for a change in X2 and no change in X3.
Y  1   2 X 2   3 X 3  u
Y  Y  1   2 ( X 2  X 2 )   3 X 3  u
=>  2 
Y
X 2
. reg linc school
Table 3.1
Source |
SS
df
MS
-------------+-----------------------------Model | 187.090253
1 187.090253
Residual | 1089.20575 4059 .268343373
-------------+-----------------------------Total |
1276.296 4060 .314358622
Number of obs
F( 1, 4059)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
4061
697.20
0.0000
0.1466
0.1464
.51802
-----------------------------------------------------------------------------linc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------school |
.1713449
.0064892
26.40
0.000
.1586225
.1840672
_cons |
1.760449
.015451
113.94
0.000
1.730156
1.790741
------------------------------------------------------------------------------
26
To get true estimate of the returns to education, we want to hold ability constant. This is done in a
multivariate regression.
. reg linc school ability
Table 3.2
Source |
SS
df
MS
-------------+-----------------------------Model | 197.859286
2 98.9296432
Residual | 1026.25968 3963 .258960302
-------------+-----------------------------Total | 1224.11896 3965 .308731138
Number of obs
F( 2, 3963)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
3966
382.03
0.0000
0.1616
0.1612
.50888
-----------------------------------------------------------------------------linc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------school |
.1525172
.0068646
22.22
0.000
.1390588
.1659757
ability |
.0702532
.0089544
7.85
0.000
.0526975
.087809
_cons |
1.793294
.0158411
113.21
0.000
1.762236
1.824351
------------------------------------------------------------------------------
Alternatively, we could purge income and schooling of the effect of ability.

To do so, we estimate
Linc  c1  c 2 Abil
and Sˆ  d 1  d 2 Abil . We then calculate the

residuals Rlinc and RS where Rlinc  linc  linc and RS  S  Sˆ .
If we now regress Rlinc on RS, we estimate the effect of education on income, accounting for
ability.
reg rlinc rs
Table 3.3
Source |
SS
df
MS
-------------+-----------------------------Model | 127.832486
1 127.832486
Residual | 1026.25967 3964 .258894972
-------------+-----------------------------Total | 1154.09216 3965 .291069901
Number of obs
F( 1, 3964)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
3966
493.76
0.0000
0.1108
0.1105
.50882
-----------------------------------------------------------------------------rlinc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------rs |
.1525172
.0068637
22.22
0.000
.1390605
.165974
_cons |
8.56e-09
.0080795
0.00
1.000
-.0158404
.0158404
------------------------------------------------------------------------------
27
Note that the two methods provide the same estimate for the returns to education. So a
multivariate model allow us to estimate the effect of a variable holding other characteristics
constant.
This method of partialing out is due to Frish and Waugh and is important when estimating fixed
effect regressions (for example multiple observations of the same individual/family/firm) .
A simple regression of schooling on earnings would lead to bias estimates of the returns to
education, since school and ability are correlated and ability has a positive effect on earnings.
This is the omitted variable bias. Omitted variable bias will appear if:
-
the regressor is correlated with the omitted variable
-
the omitted variable determines the dependent variable
If an omitted variable determines the dependent variable, then it is included in the error term. If
this variable is also correlated with X, then Xi and ui are correlated. Omitted variable bias means
that the OLS assumption that E(ui/Xi)=0 is broken. The correlation of Xi and ui means that the
conditional expectation of u is not zero.
From last week, we know that:
b2   2 
Or b2   2   X ,u
Cov( X , u)
Var( X )
u
X
(3.2)
The second term in the RHS of (3.2) is the bias due to omitted variable. This bias is not
dependent on sample size (increasing sample size does not reduce omitted variable bias), and in
this case, b2 is not a consistent estimator of β2.
The larger the correlation between Xi and ui, the larger the bias.
The direction of the bias depends on the sign of the correlation between Xi and ui
28
Now consider the more general k-variable model, where
Yt  1   2 X2 t  k Xkt  ut
(3.3)
Then upon finding the parameter values  1 ,  2 , ,  k that minimize the RSS, we get k
conditions like the equations above which get awkward to solve to obtain exact expressions for
 1 ,  2 , ,  k . Matrices do make our life easier, therefore…
Consider now the matrix form equivalent, which the basic problem is written as:
Y  X  
where,
 Y1 
1 X 1 
1 
Y 
1 X 
 
 1 
2
2
2


Y
, X
,    ,    
  

  
 
 2 
 


 
YT 
1 X T 
 T 
So the first line has:
Y1  1   2 X 1  1
and the tth line has
Yt  1   2 X t   t ,t  1,2  T
The estimates of  are obtained by minimising the RSS,
n

i 1
2
i0

 y
n
i 1
i
 x i  

2
where 0 is any choice of . So this is a basic problem of minimization
min
0
 
S       0  (y  X  )(y  X  )
  0  y y   Xy  y X    XX 
: Expanding the SSR
 0'  0  S (  )  y y  2  X y    X X 
(3.4)
 y y  2y X     X X 
This (3.4) is the statement of the RSS equivalent to the equation in scalar form. To obtain our
OLS estimates we must find the  vector that minimises RSS (in keeping with previous notation
29
we will call the sample estimate b). Differentiating with respect to  and setting equal to zero we
get
S (  )
 2Xy  2XX   0
 
(3.5)
Solving equation (3.5) for  we have:
(XX)b  Xy
where b is the solution value of  - Solving explicitly for b yields
b  (XX) 1 Xy
(3.6)
which is the matrix equivalent to (3.2).
Note that (X’X)-1 must exist (the full rank condition).
We know this solution is a maximum because if we take the second derivative of RSS with
respect to  (and again referring to the solution estimates as b) we get:
S 2 ( b)
 2 X X
bb 
which is clearly positive definite (and hence a minimum). As this expression is quadratic and it is
at least positive semi-definite, but we know the inverse exists, therefore making it positive
definite.
How does it compare with the simple model coefficients?
1
X' X  
 X 1
( X' X) 1
1
X2
1 X 1  


T
 1  1 X 2  
 

 X T      T

  X t
1
X
  t 1
T 



t 1

T
2
Xt 

t 1

T
X
t
  X t2   X t 
 t

t 1

2



X
T
T Xt   Xt  Xt

t


t

1
t 1
t 1
t 1

  X t2   X t 
1
 t

t 1

2


T
T ( X t   X t  X t / T )   X t


t 1
t 1
t 1
 t 1
1
30
( X' X)
1
  X t2
1
 t

2 
T  (Xt  X )   Xt
 t  1
t 1
1
X' Y  
 X 1
1
X2

X
t 1
t
T




 Y1   T

    Yt 

 1 Y2 

  t 1

 X T      T

   X t Yt 

YT   t 1
Therefore, combining equations (13) and (14) we have:
  X t2
1
 t
1
b  ( X' X) X' Y 
2 
T  (Xt  X )   Xt
 t 1
t 1

 T


X

t   Yt 
  t 1
t 1

 T
T

  X t Yt 
 t 1

that is,
T


2
X


t  Yt   X t  X t Yt 
 b1 
1
t 1
t 1
 t 1 T t 1

b    
2
b

T
(
X

X
)

t
 2 
X t Yt   X t  Yt 
 T
t 1
t 1
t 1
t 1


Therefore, taking the 2 elements separately we have:
b2 
T  ( X t  X )(Yt  Y )
t 1
T  ( X t  X )2
t 1
Now, rearranging the 1st element it can be shown that
b1 
 Y { X
t 1
t
t 1
2
t
 TX 2 )  ( Yt )TX 2  TX 2  X t Yt
t 1
t 1
T  ( X t  X )2
 Y  ˆ 2 X
t 1
which are equivalent to (2.6).
3.2 Least square assumptions in multiple regression
The first 4 assumptions are identical to those imposed in the simple linear model.
31
-
Assumption 1: The conditional mean of u is 0
E(ui / X 1i ,.., X ki )  E(ui )  0
(A3.1)
On average over the population, Yi falls on the regression line (no systematic error). As we will
see below, this is the key assumption that makes OLS unbiased.
-
Assumption 2: (Xi,Yi) are independently and identically distributed.
-
Assumption 3: Population variance of u is constant for all i.
Formally, this condition can be written as:  u2i   2 i
(A3.2)
(A3.3)
Of course  2 is unknown. This property is known as homoskedasticity (constant variance). This
is used below to show that OLS is the Best Linear Unbiased Estimate. In the presence of
heteroskedasticity, the standard error of OLS will be biased, but not the estimates.
-
Assumption 4: Normality assumption
(A3.4)
-
Assumption 5: No perfect multicollinearity
Regressors are multicollinear if one of the regressor is a linear function of the others.
Algebraically this can be noted as:
E(X’X)=K
(A3.5)
The X’X matrix has full rank.
Say we have 3 regressors and 4 observations such that
X0
id
1
2
3
4
X1
1
1
1
1
X2
7
5
3
1
X3
2
3
4
5
16
13
10
7
You can see that X3=2*X1+X2.
Multicollinearity makes it impossible to calculate OLS since it leads to a division by 0.
32
Statistical packages will therefore not allow you to calculate OLS in these circumstances, and you
need to respecify your model.
3.3 Properties of the OLS Estimator
What are the virtues of OLS on a statistical basis. Here the discussion must address the issues of
unbiasedness, efficiciency etc.
Equations (3.6) is the formula for the OLS estimator. From
these it is possible to work out the expectation and variance of the estimator, b.
3.2.1 Unbiasedness
b  ( X' X) 1 X' Y
 ( X' X)  1 X' ( X  )
 ( X' X)  1 X' X  ( X' X)  1 X' 
as the first term has ( X' X) 1 X' X  I we can write
b    ( X' X) 1 X' 
Therefore taking expectations of (18) we have
E (b)    ( X' X) 1 X' E ()

(3.7)
as E(ε)=0 (A3.1).
Equation shows that the OLS estimator is unbiased.
3.2.2 Variance of the Estimator
V (b)  E[(b  )(b  )' ]
 b1

b
 E 2


bk
  

 2 
b  1
 1


  k 

b2   2  bk   k
therefore
33


(b1   1 ) 2
(b1

(b   1 )(b2   2 )
V (b)  E  1


(b1   1 )(bk   k ) (b2
 V (b)
cov( b1 , b2 )

cov( b1 , b2 )
V (b2 )
 




cov( b1 , bk ) cov( b2 , bk )
  1 )(b2   2 )
(b2   2 ) 2

  2 )(bk   k )
 (b1   1 )(bk   k ) 

 (b2   2 )(bk   k )





(bk   k ) 2

 cov( b1 , bk ) 

 cov( b2 , bk )





var( bk ) 
We know that:
b    ( X' X) 1 X' 
Therefore,
(b  )(b  )'  ( X' X) 1 X'  ' X( X' X) 1
taking expectations
E[(b  )(b  )' ]  ( X' X) 1 X' E ( ' )X(X' X) 1
and as V ()   2 I by assumption (A3.3) we can show


E[( b   )( b   )' ]  ( X' X ) 1 X'  2 I T X ( X' X ) 1
  2 ( X' X ) 1 X' X ( X' X ) 1
or
V (b)   2 ( X' X) 1
(3.8)
Hence, lim(V(b))=0 as n ∞
This formula is valid for the variance of OLS estimates is valid only in the case of
homoskedasticity.
Since  2 is unknown, we use an unbiased estimator s2. where s 2 
e' e
nK
* Adjusted R2
The coefficient of determination is the proportion of the variation in the dependent variable
explained by the model, and is calculated as
R2  1 
RSS
ESS

,0  R 2  1
TSS
TSS
34
This must increase as more variables are added to the equation regardless of their relevance. An
alternative therefore is the adjusted R2 which is useful for comparing the fit of specs that differ in
the addition or deletion of explanatory variables. In contrast, therefore
R2  1 
RSS / ( n  k )
TSS / ( n  1)
(3.9)
only increases if the t-ratio on the included variable is greater than unity. We can see the
relationship between the two measures as


(n  1)
1  R2
(n  k )
1 k
(n  1) 2


R
nk
(n  k )
R2  1 
(3.10)
i)
An increase in R 2 does not necessary means that the added variable is significant.
ii)
A high R 2 does not mean that the regressors are a true cause of the dependent
variable
iii)
A high R 2 does not mean there is no omitted variable bias
iv)
A high R 2 your specification is the most appropriate
The R 2 tells you only that the regressors are good at explaining the values of the dependent
variable in your sample.
3.5 Hypothesis testing
3.5.1 hypothesis testing for a single coefficient.
After estimating a coefficient (slope of the regression), you want to assess whether this estimate is
statistically different from 0. More generally, you may want to test whether your coefficient b1 is
different from b1,0.
* Two sided tests
H0: b1=b1,0 vs H1: b1  b1,0
This is known in statistics as a t-test.
35
t
b1  b1,0
b
1
Comparing t with critical values, you can assess whether H0 is rejected or not.
If t<tc, cannot reject H0, b1 is not statistically different from b1,0.
Traditionally, econometricians rely on the 95% confidence, thus the critical value for tc = 1.96
(since we have assumed normality). Confidence interval at 90% and 99% are also used, for which
the critical values are respectively (1.645 and 2.576).
Alternatively, rather than t-values, some econometricians rather present p-values. P-values give
the probability of obtaining the corresponding t-statistics as a matter of chance. So whilst when
testing whether your coefficient is different from 0 you want a large t-value (greater than 1.96),
you want a p-value that is lower than 0.05. p-value gives you the probability of a type I error.
The two methods provide the same information, but p-values are easier to interpret as it gives the
probability of a Type I error (wrongly rejecting H0).
True
Decision
H0
H1
H0
ok
Type I error
H1
Type II error
ok
Type I error: Wrongly reject the true null hypothesis
Type II error: Not reject Null when false.
The higher your confidence interval the lower the risk of Type I error. (5% vs 1%) but the higher
the risk of Type II error. So at the 99% confidence interval, when you accept H1, you are wrong
in only 1% of the cases, however this means that you will conclude that your coefficient is not
different from 0 in a large number of cases (Type II error).
Going back to the output of Table 3.2. We want to know whether our estimates are significantly
different from 0 with a 95% confidence interval (probability of Type I error less than 5%).
The school estimate as a standard error of 0.006, so
t= (0.152 – 0) / 0.0068 = 22.22
36
t>tc = 1.96, so we accept that the schooling coefficient is different from 0 at the 95% confidence
interval. Since t>2.576 we also accept it at the 99% confidence interval. In fact our estimate is so
precise
Table 3.2
Source |
SS
df
MS
-------------+-----------------------------Model | 197.859286
2 98.9296432
Residual | 1026.25968 3963 .258960302
-------------+-----------------------------Total | 1224.11896 3965 .308731138
Number of obs
F( 2, 3963)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
3966
382.03
0.0000
0.1616
0.1612
.50888
-----------------------------------------------------------------------------linc |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------school |
.1525172
.0068646
22.22
0.000
.1390588
.1659757
ability |
.0702532
.0089544
7.85
0.000
.0526975
.087809
_cons |
1.793294
.0158411
113.21
0.000
1.762236
1.824351
------------------------------------------------------------------------------
* one sided t-test
Sometimes it makes more sense to test whether a coefficient is greater (smaller) than a given
value rather than whether it is different (as in the two sided t-test).
H0: b1=b1,0 vs b1<b1,0
The t-test is the same as in the two sided test but the critical values are different. Here is an
example demonstrating the logic of a one-sided test.
Supposed you received some product, which you know can be of normal or extra quality.
However, it is not documented which quality you got. Each quality has a given distribution, so
you decide to take a sample and compare your sample mean with the normal quality product
mean (your prior here is that you got sent normal quality rather than extra).
37
ONE-TAILED TESTS
0
null hypothesis: H0 : 2 = 2
alternative hypothesis: H1 : 2 = 12
probability density
function of b2
2.5%
2.5%
20-2sd 20 -sd
38
20
20+sd 20 +2sd 21
The sample mean (  2 ) is in the critical value to reject that  20 however, it will be stupid to
accept  21 since  2 contradicts  21 even more. This is when a one sided test is needed.
So basically, we want to get rid off the left hand side critical region. But doing so, we need to
increase the right hand side critical region for our significance level to remain at 95%. So we
stack all the type I error in the right tail of the distribution. So the critical values for a one sided
test are at the 99 and 95 confidence level are (1.645 and 2.326).
* Test of join Hypotheses
we want to test that all of our estimated coefficients are different from 0.
H0: b2=0 & b3 =0 & … & bk=0 vs at least one coefficient is different from 0
The F-test is then defined as:
F ( k  1, n  k ) 
SSE /( k  1)
SSR /( n  k )
n
n
i 1
i 1
2
where : Explained Sum of Square (SSE): SSE   ( yˆ i  y ) 2 and SSR   e i
R2 
since:
SSE
SSR
1
SST
SST
the F-test can also be written as:
F ( k  1, n  k ) 
R 2 /( k  1)
(1  R 2 ) /( n  k )
Using Table 2 results, F(2,3963)=(0.16)/2 / (1-.16) / 3963 = 383.03 > Fc(2,3963)=99.50
So we reject that all of our estimates are null.
*F-test on a subset of coefficients.
It is also possible to conduct hypothesis testing on a subset of coefficients:
- 2 restriction case:
H0: b2=0 & b3 =0 vs b2 or b3  0.
39
 t 32  t 22  2 ˆ t t t 3t 2
31 2
In this case F  1 / 2 * 
2

ˆ
1


t 31t 2





In case of more than two variables the formula becomes more complicated, but this test can be
done on any econometric package.
* Test a single restriction involving multiple coefficients
H0 : b1=b2 vs H1: b1  b2
Suppose you regression has the following form:
Yi   0  1 X 1   2 X 2  ui
To test b1=b2, we add and subtract  2 X 1 , the model then becomes:
Yi   0  ( 1   2 ) X 1   2 ( X 1  X 2 )  ui
which can be rewritten as:
Yi   0  ( 1 ) X 1   2 (W )  ui
Then conducting a t-test on  1 is equivalent to the desired F-test.
40
Going back to our pupil teacher ratio and test score example, we assess the effect of the
specification on our conclusions, this highlight the problem of omitted variable bias.
For omitted variable bias to occur, we must have:
At least one regressor is correlated with the omitted variable
The omitted variable is a determinant of the outcome
Student
teacher
ratio
(1)
test score
-2.280
(2)
test score
-1.101
(3)
test score
-0.998
(4)
test score
-2.165
(5)
test score
-1.014
(0.519)**
(0.433)*
-0.650
(0.270)**
-0.122
(0.384)**
(0.269)**
-0.130
(0.031)**
(0.033)**
-0.547
% english
learner
% reduced
meal
(0.036)**
-0.529
(0.024)**
% income
assistance
Constant
698.933
(10.364)**
Observations 420
R-squared
0.05
Robust standard errors in
-1.036
686.032
(8.728)**
420
0.43
parentheses *
700.150
(5.568)**
420
0.77
significant
(0.038)**
-0.048
(0.076)**
(0.059)
710.406
700.392
(7.819)**
(5.537)**
420
420
0.44
0.77
at 5%; ** significant at
1%
Previously we estimated model (1). However, we think that the proportion of non-english speaker
pupils will affect both the PTR and the score. Model (2) confirms this hunch, since the coefficient
on PTR is halved. Both covariates are significant at the 1% level. Note also that the R2 increased
substantially, so this specification better fits the data.
Adding the proportion of pupils on subsidized meal reduces further the PTR coefficients. All
covariates are significant and the R2 reaches 0.77.
We then think that all these new covariates may be related to poverty. Adding income assistance
to the base model confirms that income, PTR and test are correlated. However, in the complete
model, % on income assistance has no significant effect. Note that the R2 does not improve when
including this extra variable.
So due to omitted variable bias, last week recommendation to the headteacher would have to be
revised. Note also, that it does not really matter which control we use for students characteristics
41
since the coefficient on PTR does not change much between model 2 and 5. This could indicate
that we have efficiently dealt with the omitted variable bias problems.
42
Chapter 4: Non linear regression functions
So far we have assumed linearity of the regression function so that the slope of the population
regression function is constant: A unit change in X produced the same effect on Y at all point of
the X distribution. This is a stringent hypothesis. What is the effect of X is non constant? We
must then think of a non-linear relationship.
In fact, there are two types of problems.
1) The effect of a change in X1 on Y depends on the value of X1.
2) The effect of a change in X1 on Y depends on the value another covariate X2.
Y
Constant slope model
X1
X1
Slope depends on X1
X2=0
X2=1
43
X1
Slope depends on X2
4.1. General strategy for modelling non-linear model
4.1.1 Polynomial functions
We have seen last week that test score were correlated with some measures of poverty (%
families qualifying for income support, % non English speakers). We know use the district
median income as a measure of family background. The median income has a mean of $13,700
and ranges from $5,300 to $55,300. The correlation between the 2 variables is 0.71.
Regression with robust standard errors
Number of obs =
F(
1,
420
418) =
273.29
Prob > F
=
0.0000
R-squared
=
0.5076
Root MSE
=
13.387
-----------------------------------------------------------------------------|
testscr |
Robust
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------avginc |
1.87855
.1136349
16.53
0.000
1.655183
2.101917
_cons |
625.3836
1.867872
334.81
0.000
621.712
629.0552
------------------------------------------------------------------------------
44
Fitted values
test score
740
720
test score
700
680
660
640
620
600
20
10
0
50
40
30
median district income
60
When income is very low, say under £10,000 most points are under the OLS line, conversely, for
high income (>$40,000) all points are below the OLS line. This is because we are missing the
curvature of the relationship, by imposing linearity. The relationship looks much more like a
quadratic function.
Regression with robust standard errors
Number of obs =
F(
2,
420
417) =
428.52
Prob > F
=
0.0000
R-squared
=
0.5562
Root MSE
=
12.724
-----------------------------------------------------------------------------|
testscr |
Robust
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------avginc |
3.850995
.2680941
14.36
0.000
3.32401
4.377979
avginc2 |
-.0423085
.0047803
-8.85
0.000
-.051705
-.0329119
_cons |
607.3017
2.901754
209.29
0.000
601.5978
613.0056
------------------------------------------------------------------------------
45
Fitted values
test score
Fitted values
740
720
test score
700
680
660
640
620
600
0
10
20
40
30
median district income
50
60
While it looks like we have done a better job at fitting a regression line using a quadratic
function, we can this more formally. If we believe that the relationship is linear then the
coefficient on earning 2 should not be significantly different from 0.
H0: b2=0 vs. b2  0
This is a two sided t-test, t=-.0423/0.0048= -8.81. We cannot accept H0, thus a quadratic is a
better fit than the linear model.
What is the effect on test score of a change in income of $1,000?
* move from $10,000 to $11,000.
Yˆ  Yˆ11  Yˆ10
Yˆ  (b0  b1 * 11  b2 * 11 2 )  (b0  b1 * 10  b2 * 10 2 )
Yˆ  644.53  641.57  2.96
46
The predicted difference in the test score achieved by pupils in a district where the median
income is $11,000 rather than $10,000 is 2.96 points.
* move from $40,000 to $41,000
Similarly, we can calculate that the difference in test score between a district where the median
income is $41,000 rather than $40,000 is 0.42 points.
If we had believed in the linear model, we would have concluded that at all points of the earnings
distribution, an increase of $1,000 leads to a point increase of: 1.87.
/*
-Standard error of the estimated effects
Say that we want to make a recommendation about the potential effect on test score of a change
of median income from $10,000 to $11,000, we need to compute the confidence interval around
our estimated effect of 2.96. ( b1 X 1  1.96 X 1 X 1 ).
 X can be computed from the following:
1
since: Yˆ  b1 * (11  10)  b2 * (11 2  10 2 )  b1  21b2
 X   (b1  21b2 )
1
Statistical digression:
We should all know that the F statistics is equivalent to the square of the t-statistics.
Hence:
 b
F  t  
 b
2
2
b

   b 
F

*/
Quadratic functions can easily be extended to a more general polynomial function
47
Say we have the following function: Yi  b0  b1 X i  b2 X i2  ...  br X ir
First, we want to test that a simple linear model would not be the most appropriate. We do so by
implementing a F-test:
H0 b2=0 & b3=0 & … br=0
vs H1: at least 1 coefficient is different from 0
To do so, we first estimate the restricted model, for which we force the null hypothesis to be true.
So we estimate: Y  b0  b1 X , and calculate the SSRr (Residual sum of square of the restricted
model).
Then, we estimate the unrestricted model, for which the alternative hypothesis is true:
Y  b0  b1 X  b2 X 2  ...  br X r , and calculate SSRu.
If the sum of square residuals is sufficiently smaller in the unrestricted model than the restricted
model, then the test rejects the null hypothesis.
F ( q, n ) 
( SSRr  SSRu ) / q
( Ru2  Rr2 ) / q

SSRu /( n  k u  1) (1  Ru2 ) /( n  k u  1)
where q is the number of restrictions tested (here r-1).
Say, that we think the relationship between median income and test score may be cubic, we
estimate the restricted model:
Restricted model
Regression with robust standard errors
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
273.29
0.0000
0.5076
13.387
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------avginc |
1.87855
.1136349
16.53
0.000
1.655183
2.101917
_cons |
625.3836
1.867872
334.81
0.000
621.712
629.0552
------------------------------------------------------------------------------
Unrestricted model
Regression with robust standard errors
Number of obs =
F( 3,
416) =
Prob > F
=
48
420
270.18
0.0000
R-squared
Root MSE
=
=
0.5584
12.707
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------avginc |
5.018677
.7073504
7.10
0.000
3.628251
6.409104
avginc2 | -.0958052
.0289537
-3.31
0.001
-.152719
-.0388913
avginc3 |
.0006855
.0003471
1.98
0.049
3.27e-06
.0013677
_cons |
600.079
5.102062
117.61
0.000
590.0499
610.108
------------------------------------------------------------------------------
F(2,416)=(.5584-.5076)/2 / (1-.5584)/(420-4-1)=23.87 > Fc=1.46,
so we can reject the null hypothesis that the model is linear, and carry on estimating the cubic
model.
How many degrees should the polynomial have?
There is a tradeoff between flexibility (higher order) and statistical precision. So a solution is to
start with a polynomial of order r and test that the coefficient on the rth degree is significantly
different from 0. As a matter of fact, economists seldom used polynomial of order greater than 4.
4.1.2 Logarithmic function
For variables that are always positive, we can transform them into logarithm (ln x). This provides
us a non-linear relationship. The logarithmic has some interesting properties:
Ln(1/x)=-ln(x)
Ln(ax)=ln(a) + ln(x)
Ln(xa)=a*ln(x)
Furthermore, for interpretation purposes, we often rely on the following approximation:
ln( x  x )  ln( x ) 
x
x
when
x
is small
x
This leads to the following interpretations, depending on the model specification.
1) Yi  b0  b1 ln( X i )   i a 1% change in x is associated with a change in Y of 0.01b1
2) ln( Yi )  b0  b1 X i   i
a change in x by one unit is associated with a 100 b1 change in Y.
3) ln( Yi )  b0  b1 ln( X i )   i A 1% change in x is associated with a b1 % change in Y. This is the
elasticity of Y with respect to X.
49
Example 1: linear-log model
Yi   0  1 ln( X i )   i
test score
Fitted values
740
720
test score
700
680
660
640
620
600
0
10
20
30
40
median district income
Regression with robust standard errors
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
50
=
=
=
=
=
60
420
679.70
0.0000
0.5625
12.618
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lninc |
36.41968
1.396943
26.07
0.000
33.67378
39.16559
_cons |
557.8323
3.83994
145.27
0.000
550.2843
565.3803
------------------------------------------------------------------------------
Yˆ  (b0  b1 [ln( X  X )])  (b0  b1 ln( X ))  b1 (ln( X  X )  ln( X ))  b1 ( X / X )
A 1% increase in income ( X / X  0.01 ) is associated with a gain in test score of 0.36 points.
What is the predicted gains scores from an increase in median earnings from 10 to $11,000:
Yˆ  [557.8  36.42 ln( 11)]  [557.8  36.42 ln( 10)]  3.47 . Similarly, moving from 40 to 41
leads to an increase in test score of 0.90.
Example 2: log-linear model.
50
ln( Yi )   o  1 X i   i
Fitted values
lntest
lntest
6.59672
6.40614
0
10
20
50
40
30
median district income
60
The log linear model does not fit the data really well, all obs are underneath the curve at the two
tails of the distribution.
Regression with robust standard errors
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
263.86
0.0000
0.4982
.02065
-----------------------------------------------------------------------------|
Robust
lntest |
Coef.
Std. Err.
T
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------avginc |
.0028441
.0001751
16.24
0.000
.0024999
.0031882
_cons |
6.439362
.0028938 2225.21
0.000
6.433674
6.445051
------------------------------------------------------------------------------
ln( Yˆ  Yˆ )  ln( Yˆ )  (b0  b1 [ X  X ])  (b0  b1 X )  b1 ( X )

Yˆ
 b1 ( X )
Yˆ
A one-unit change in X ( X  1 ) is associated with a 100b1 % change in Y. For each additional
increase of $1,000, log test score increases by 0.28%. This model does not capture the curvature
51
of the test score variables. So the effect on test score of moving from 10 to $11,000 is equal than
moving from 40 to $41,000.
Example 3: Log-log model
ln( Yi )   0  1 ln( X i )   i
lntest
Fitted values
lntest
6.56068
6.40614
0
10
20
30
40
median district income
50
60
The log log model fits the model a bit better (higher R2) than the log linear model, but still some
problems at the tails of the distribution.
Regression with robust standard errors
Number of obs
F( 1,
418)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
667.78
0.0000
0.5578
.01938
-----------------------------------------------------------------------------|
Robust
lntest |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lninc |
.055419
.0021446
25.84
0.000
.0512035
.0596345
_cons |
6.336349
.0059246 1069.50
0.000
6.324704
6.347995
------------------------------------------------------------------------------
52
ln( Yˆ  Yˆ )  ln( Yˆ )  (b0  b1 ln( X  X ))  (b0  b1 ln( X ))
Yˆ
X
 b1
X
Yˆ
Y
Y
100 *
Y  % changeinY  e
b1  Y 
y/x
X
X % changeinX
100 *
X
X
b1 is the percentage change in Y associated with a % change in X, this is therefore the elasticity of
Y with respect to X.
A 1% increase in income is associated with a 0.0554% increase in test score.
! R2 can be used to compare the goodness of fit of the linear vs linear-log model and log-linear vs.
log-log model, but cannot be used to compare models were the dependent variable is different.
When estimating a model where the dependent variable is in log format, it is not straightforward
to predict values of Y.
Yi  exp(  0  1 X i   i )  exp(  0  1 X i ) * exp( i )
The problem is that even if E( i  0), E(exp(  i ))  1 . Thus, it is better to leave the predicted
values in logarithm format.
4.2 Interactions between variables
4.2.1 Dummy variable
It often happen that some of the factors in a regression are qualitative in nature and therefore not
measurable in numerical terms:
-
gender differences
-
country differences
-
cohort effect,…
Say, we are interested in measuring the earning differential (pay-gap) between men and women.
One solution would be to run regressions separately for men and women and then compare the
coefficients (see section xx). Alternatively, it is possible to run a pool regression with a gender
indicator (dummy variable).
53
A dummy variable takes the value 1 for one category and 0 for all the others. In this simple case,
we would create a dummy for men, which will take the value 1 for all men (alternatively, we
could have decided: woman=1, man=0).
If the character we are interested in has more than two possible alternatives (say region of a
country, or quarter of the year), we can create a set of dummy characteristics.
! Creation of dummies can easily lead to problems of multicollinearity.
Q1
1
0
0
0
Q2
0
1
0
0
Q3
0
0
1
0
Q4
0
0
0
1
cst
1
1
1
1
As a determinant of lnwage, we rely on experience and gender.
* Pooled population
. reg lnpay emthemp
Source |
SS
df
MS
-------------+-----------------------------Model | 13.6560893
1 13.6560893
Residual | 76.2337868
297 .256679417
-------------+-----------------------------Total | 89.8898761
298
.30164388
Number of obs
F( 1,
297)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
299
53.20
0.0000
0.1519
0.1491
.50664
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0061503
.0008432
7.29
0.000
.0044909
.0078097
_cons |
9.335957
.0796927
117.15
0.000
9.179123
9.492791
------------------------------------------------------------------------------
Each months of experience on the labour market adds 0.6% to the wage of graduates. This
coefficient is significantly different from 0. Now we are interested in gender differences in wages.
If we think that women are discriminated at the point of entry to the labour market and that
employers offer them lower starting wages, but that after this initial discrimination, experience is
rewarded similarly for men and women, we want to fit a model were the relationship between
experience and wages differs only by the intercept for both gender.
The gender dummy estimates this shift in the intercept.
54
. reg lnpay emthemp p_gender
Source |
SS
df
MS
-------------+-----------------------------Model | 17.6557409
2 8.82787045
Residual | 72.2341352
296 .244034241
-------------+-----------------------------Total | 89.8898761
298
.30164388
Number of obs
F( 2,
296)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
299
36.17
0.0000
0.1964
0.1910
.494
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0055561
.0008352
6.65
0.000
.0039125
.0071997
men |
.2352692
.0581138
4.05
0.000
.1209006
.3496378
_cons |
9.264647
.0796763
116.28
0.000
9.107844
9.421451
------------------------------------------------------------------------------
Men are paid 23% more than women. This coefficient is significantly different from 0 (t=6.65).
lnpay
Fitted values
Fitted values
11
lnpay
10
9
8
7
0
50
100
150
month employed
Alternatively, we could have separated the sample by gender and run separate regressions for men
and women.
What is the advantage of using a dummy variable?
-
Pooling the sample reduces the population variances of the coefficients  smaller
standard errors.
-
The coefficient on experience is easily interpretable
-
The difference in the two populations is easily estimated.
55
What is the drawback of using a dummy variable?
-
Assume that the returns to experience are the same for men and women, and the two
populations only differ by a shift in the intercept. i.e. We assume: bm=bf
This can easily be extended to models were the categorical variable takes on more than two
values. One needs to define a reference category (to which the basic intercept applies) and a set
of dummies for the other categories. It is often good practice to define the reference category as
the dominant one (most observations). Which ever category is chosen as the reference, the R2,
coefficients and standard errors on other variables, and the F-statistics will be the same. The only
change are for the coefficients and standard errors of the dummies. Even if the final interpretation
is not affected.
When estimating a model with a group of dummies for a categorical variable, the t-statistics for
the significance of each independent dummy can be calculated but more importantly, it is useful
to conduct a test of the joint explanatory power of the dummies.
As in the case of determining the order of a polynomial function, this is conducted by a F-test.
H0 b2=0 & b3=0 & … bd=0
vs H1: at least 1 coefficient is different from 0
First, we want to estimate a model without the set of dummies for the categorical variable, this is
the restricted model.
Then, we estimate the unrestricted model, with the full set of dummies.
We reject H0 is the F(d,n-d-1)>Fc
F (d , n  d  1) 
( Ru2  Rr2 ) / d
(1  Ru2 ) /( n  k u  1)
where d is the number of dummies in the unrestricted model (nbr of category –1).
4.2.2 Interaction between dummy variables
Consider that we also want to control for the choice of subject at university (science vs
humanities):
ln Yi   0  1 Expi   2 Male i   3 Science i   i
56
So we can estimate the effect of having a scientific degree on wages, holding experience and
gender constant. But we are concerned that the effect of the choice of subject on wages may be
different by gender. So we estimate the following model which include the interaction term
between sex and type of subject:
ln Yi   0  1 Expi   2 Male i   3 Science i   4 ( Male * science )   i
The interaction term coefficient estimates the effect of a science degree on earnings for male. To
demonstrate this mathematically, let’s simplify the model to:
Yi   0  1 D1   2 D2   3 ( D1 * D2 )   i
E(Yi / D1i  d1 , D2i  0)  b0  b1 * d1
E(Yi / D1i  d1 , D2i  1)  b0  b1 * d1  b2  b3 * d1
hence:
E(Yi / D1i  d1 , D2i  0)  E(Yi / D1i  d1 , D2i  1)  b2  b3 * d1
Thus the effect of D2 depends on the value of D1.
If for individual i, D1=d1=0, then the effect of D2 on Y is :b2
If D1=d1=1, then the effect of D2 on Y is b2+b3*d1
4.2.3 Interactions between a continuous and a binary (dummy) variable
Say that we are concerned that gender discrimination does not stop at the recruitment process but
that the return to experience varies by gender. By including the G dummy, we allowed the two
slopes to have a different intercept but parallel. By adding an interaction term, we allow the two
slopes to be different So we want to estimate:
ln Yi   0  1 X i   2 Gi   3 ( X i * Gi )   i
For women, Gi=0, and the regression line is given by: ln Yi  b0  b1 X i
Form men: Gi=1, and the regression line is: ln Yi  (b0  b2 )  (b1  b3 ) X i
57
Fitted values
lnpay
Fitted values
11
lnpay
10
9
8
7
0
150
100
50
month employed
. reg lnpay emthemp p_gender empmal
Source |
SS
df
MS
-------------+-----------------------------Model | 18.5032689
3
6.1677563
Residual | 71.3866072
295 .241988499
-------------+-----------------------------Total | 89.8898761
298
.30164388
Number of obs
F( 3,
295)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
299
25.49
0.0000
0.2058
0.1978
.49192
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0038518
.0012333
3.12
0.002
.0014246
.006279
male | -.0367646
.1564554
-0.23
0.814
-.3446747
.2711454
empmal |
.0031257
.0016702
1.87
0.062
-.0001613
.0064126
_cons |
9.403501
.1086282
86.57
0.000
9.189717
9.617286
------------------------------------------------------------------------------
The model predicts that in fact men have lower starting wages than women. The coefficient on
male is negative (insignificant) and represent the shift in the intercept, which in this simple
model, is the wage for an individual with 0 months of experience. However, for women, returns
to experience add 0.3% every month, whilst for men it is 0.69% (.00385+.00312). The interaction
term is significantly different from 0 (at the 10% level), so we reject the assumption that the slope
on the returns to experience is the same for men and women. To summarise, we could estimate a
model where we impose that the two regression lines have different slope but the same intercept.
Note that we obtain the same coefficient when we estimate the model separately by gender:
58
. reg lnpay emthemp p_gender empmal
Source |
SS
df
MS
-------------+-----------------------------Model | 18.5032689
3
6.1677563
Residual | 71.3866072
295 .241988499
-------------+-----------------------------Total | 89.8898761
298
.30164388
Number of obs
F( 3,
295)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
299
25.49
0.0000
0.2058
0.1978
.49192
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0038518
.0012333
3.12
0.002
.0014246
.006279
male | -.0367646
.1564554
-0.23
0.814
-.3446747
.2711454
empmal |
.0031257
.0016702
1.87
0.062
-.0001613
.0064126
_cons |
9.403501
.1086282
86.57
0.000
9.189717
9.617286
-----------------------------------------------------------------------------. reg lnpay emthemp if p_gender==0
Source |
SS
df
MS
-------------+-----------------------------Model | 2.36041929
1 2.36041929
Residual | 36.9701933
140 .264072809
-------------+-----------------------------Total | 39.3306126
141 .278940515
Number of obs
F( 1,
140)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
142
8.94
0.0033
0.0600
0.0533
.51388
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0038518
.0012883
2.99
0.003
.0013047
.0063989
_cons |
9.403501
.1134768
82.87
0.000
9.179152
9.627851
. reg lnpay emthemp if p_gender==1
Source |
SS
df
MS
-------------+-----------------------------Model | 9.28773401
1 9.28773401
Residual | 34.4164139
155
.22204138
-------------+-----------------------------Total | 43.7041479
156 .280154794
Number of obs
F( 1,
155)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
157
41.83
0.0000
0.2125
0.2074
.47121
-----------------------------------------------------------------------------lnpay |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------emthemp |
.0069774
.0010788
6.47
0.000
.0048463
.0091086
_cons |
9.366737
.107857
86.84
0.000
9.153677
9.579796
59
Model specification summary:
1) different intercept, same slope:
Yi   0  1 X i   2 Di   i
2) Different intercept, different slope:
Yi   0  1 X i   2 Di   3 ( X i  Di )   i
3) Same intercept, different slope:
Yi   0  1 X i   3 ( X i  Di )   i
Some econometricians do not recommended estimating model 3 and advocate that the primary
elements of the interaction terms should always be included in the model.
4.2.4 Interaction between 2 continuous variables
Nothing limits us to interaction with dummy variables only. For example, to go back to our
Californian schools example, we may believe that the effect of the pupil teacher ratio on test
differs depending on the percentage of non-english speakers in the school. We previously
estimated a model controlling for the percentage of English speakers but still imposed that the
effect of PTR was independent of this proportion. To drop this assumption, we interact PTR and
English speaker.
Regression with robust standard errors
Number of obs
F( 3,
416)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
155.05
0.0000
0.4264
14.482
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -1.117018
.5875135
-1.90
0.058
-2.271884
.0378468
el_pct | -.6729116
.3741231
-1.80
0.073
-1.408319
.0624958
streng |
.0011618
.0185357
0.06
0.950
-.0352736
.0375971
_cons |
686.3385
11.75935
58.37
0.000
663.2234
709.4537
------------------------------------------------------------------------------
When the percentage of English learners is at the median (el_pct=8.85), the slope of the line
relating test scores to str is: -1.11 (=-1.12+.0012*8.85). When the percentage of English learners
is at the 75% percentile (el_pct=23), the slope is –1.09. For a district with 8.85 non English
speakers, reducing the str by 1 unit, improve test scores by 1.11 points, but the same change in str
increases test score by only 1.09 is a district with 23% non English speakers.
60
4.3 Chow test
If your data consist of two or more subsamples (by gender, cohort,..) you may want to test
whether it is appropriate to conduct the analysis on the pool sample (P) or separately for the
different subsamples (1 & 2). This is only a special case of a F-test. The unrestricted model is
estimating the coefficients separately for the two subsamples. Since the subsamples minimize the
RSS for their observations, they must fit the data at least as well as the pooled sample (restricted).
There is a price to pay for the fit improvement, since twice as many coefficients need to be
estimated, k extra degrees of freedom are used up in the separate estimation.
Chow test: H0: b1=b2 vs H1 b1  b2
F ( k , n  2k ) 
( RSS p  ( RSS 1  RSS 2 )) / k
( RSS 1  RSS 2 ) /( n  2k )
If F(k,n-2k)>Fc, then reject H0, and it you conclude that it is more appropriate to estimate the
model for the two populations separately. (different slope, different intercept).
Application to the gender wage gap example:
F(2,299-4)=(76.23-(36.97+34.41))/2
(36.97+34.41)/(299-4)
= (76.23-71.38)/2 = 2.425 = 10.1 >Fc=3
(71.38)/295
0.24
So we reject Ho and we estimate the model separately by gender.
/*
A Chow test is simply a test of whether the coefficients estimated over one group of the data are
equal to the coefficients estimated over another, and you would be better off to forget the word
Chow and remember that definition.
61
Let's pretend you have some model and two or more groups of data. Your model predicts
something about the behaviour within the group based on certain characteristics that vary within
the group. Under the assumption that each group's behavior is unique, you have
y1 = X1*b1 + u1
(equation for group 1)
y2 = X2*b2 + u2
(equation for group 2)
Now, you want to test whether the behavior for one group is the same as for another, which
means you want to test
H0: b1 = b2 = ...vs H1 b1  b2
Testing coefficients across separately estimated models is difficult to impossible, depending on
things we need not go into right now. A trick is to "pool" the data to convert the multiple
equations into one giant equation, so that we can use the tools that we know about:
y = d1*(X1*b1 + u1) + d2*(X2*b2 + u2)
where y is the set of all outcomes (y_1, y_2), and d1 is a variable that is 1 when the data are for
group 1 and 0 otherwise, d2 is 1 when the data are for group 2 and 0 otherwise, ....
Rewriting the model a little bit:
y =d1*X1*b1 + d2*X2*b2 + d1*u1 + d2*u2
= (X1*d1)*b1 + (X2*d2)*b2 + d1*u1 + d2*u2
By stacking the data, I can get back estimates of b1, b2, ...
I include not X_1 in my model, but X_1*d1 (a set of variables equal to X_1 when
group is 1 and 0 otherwise); I include not X_2 in my model, but X_2*d2 (a set of
variables equal to X_2 when group is 2 and 0 otherwise); and so on.
. regress y group1 attitude1 price1 group2 attitude2
price2, nocons
What is this nocons option? We must remember that when we estimate the separate models, each
has its own intercept. There was an intercept in X_1, X_2, and so on. What I have done above is
literally translate
y
= (X_1*d1)*b1 + (X_2*d2)+b2 + d1*u1 + d2*u2
and so included the variables group1 and group2 (variables equal to 1 for their respective groups)
and told Stata to omit the overall intercept.
I do not recommend you estimate the model the way I have just illustrated because
of numerical concerns -- we'll get to that -- but I do recommend you try it.
Estimate the models separately or jointly, and you will get the same estimates for
b_1 and b_2.
62
Now we can test whether the coefficients are the same for the two groups:
. test _b[attitude1]==_b[attitude2]
. test _b[price1]==_b[price2], accum
That is the Chow test. Notice that in the Chow test something was omitted: the intercept. If we
really wanted to test whether the two groups were exactly the same, we would type
. test _b[attitude1]==_b[attitude2]
. test _b[price1]==_b[price2], accum
. test _b[group1]==_b[group2], accum
Using this approach, however, we are not tied down by what the "Chow test" is able to test. We can
formulate any hypothesis we want. We might think that price works with same way in both groups,
but that attitude works differently, and each group has its own intercept. In that case, we could test
. test _b[attitude1]==_b[attitude2]
by itself. If we had more variables, we could test any subset of variables.
Is "pooling the data justified"? Of course it is: we just established that pooling the
data is just another way of estimating separate models and that estimating separate
models is certainly justified -- note that we got the same coefficients. That's why I
told you to forget the phrase about whether pooling the data is justified. People
who ask that don't really mean to ask what they are saying: they mean to ask
whether the coefficients are the same. In that case, they should say that. Pooling is
always justified, and it corresponds to nothing more than the mathematical trick of
writing separate equations,
y_1 = X_1*b_1 + u_1
(equation for group
1)
y_2 = X_2*b_2 + u_2
(equation for group
2)
as one equation
y
= (X_1*d1)*b1 + (X_2*d2)+b2 + d1*u1 + d2*u2
There are a large number of ways I can write the above equation, and I want to write it a little
differently because of numerical concerns. Starting with
y
= (X_1*d1)*b1 + (X_2*d2)*b2 + d1*u1 + d2*u2
let's do some algebra to obtain
y
= X*b1 + d2*X_2*(b2-b1)
+ d1*y1 + d2*u2
where X = (X_1, X_2). In this formulation, I measure not b1 and b2, but b1 and (b2-b1). This is
numerically more stable, and I can still test that b2==b1 by testing whether (b2-b1)=0. To estimate
this model, I type
. regress y attitude price attitude2 price2 group2
and, if I want to test whether the coefficients are the same, I type
. test _b[attitude2]=0
. test _b[price2]=0, accum
and that will give the same answer yet again. Try it. If I want to test whether *ALL* the coefficients
are the same (including the intercept), I type
. test _b[attitude2]=0
. test _b[price2]=0, accum
. test _b[group2]=0, accum
Just as before, I can test any subset.
Using this difference formulation, if I had three groups, starting with
y
= (X_1*d1)*b1 + (X_2*d2)*b2 + (X_3*d3)*b3 + d1*u1 +
d2*u2 + d3*u3
as
y
= X*b1 + (X_2*d2)*(b2-b1) + (X_3*d3)*(b3-b1) + d1*u1 +
d2*u2 + d3*u3
and so estimate
. regress y attitude price attitude2 price2 group2 /*
63
*/ attitude3 price3 group3
and then if I wanted to test whether the three groups were the same in the Wald-test sense, I would
type
. test _b[attitude2]=0
. test _b[price2]=0, accum
. test _b[group2]=0, accum
. test _b[attitude3]=0, accum
. test _b[price3]=0, accum
. test _b[group3]=0, accum
which I could more easily type as
. testparm attitude2 price2 group2 attitude3 price3 group3
*/
4.4 Oaxaca decomposition
Estimating the model separately for the two population, we now want to compare the results.
Formally, a Mincer equation is estimated separately for both genders.
ln wig  X ig  g   ig
(A1)
The left-hand side of (2) is the log wage of individual i of gender g, the determinants of which are
included in a vector Xig. g is the estimated vector of the returns to characteristics Xig and ig is an
individual error term. The average gender gap in earning is decomposed between the mean
difference in observed characteristics and the difference in the returns to these characteristics.
  ln wm  ln w f  ( X m  X f )  f  (  m   f ) X m
(A2)
(A1) can be expressed at the mean characteristics of men (m) or women (f)a1. The first term of
(A2) is the part of the gender pay gap that can be explained by the differences in the observed
characteristics of both groups. The second part, the unexplained component, is the portion of the
gap that is due to differences in the returns to characteristics between the two groups. If all the
a1
The choice of gender to decompose (A1) is not without effects on the results and alternative decomposition
avoiding the bias of choosing one group rather than the other have been proposed (see Cotton, 1988). This debate is
beyond the scope of this paper, and we only report results estimated at the mean characteristics of the female
population.
64
determinants of earnings were observed in (2) this will be equivalent to a discrimination effecta2;
i.e. Gender differences in the returns to observed characteristics are due to the discriminatory
behaviour of employers. As typically not all the determinants of (2) are observable, we will refer
to the second term of (3) as the unexplained component. Introducing extra variables in the vector
X reduces the unexplained part of the gender wage gap.
In our simple example of the gender wage gap:
. su lnpay emthemp
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------lnpay |
299
9.876527
.5492212
7.600903
11.0021
emthemp |
299
87.89298
34.8063
0
135
. su lnpay emthemp if p_gender==0
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------lnpay |
142
9.717314
.5281482
7.600903
11.0021
emthemp |
142
81.47183
33.59093
19
133
. su lnpay emthemp if p_gender==1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------lnpay |
157
10.02053
.5292965
7.600903
11.0021
emthemp |
157
93.70064
34.97004
0
135
  10.02  9.71 =0.49
0.0038
  93.7 1  81.47 1 * 
  0.0070 9.37  o.0038
 9.40 
93.70
9.40 * 

 1 
=0.0456 +
Omitted variables in (A1) bias the estimated g differently for both gender, leading to a perceived differential
between genders in the returns to the observed characteristics.
a2
65
Chapter 5: Heteroskedasticity and data problems
We have only highlighted so far that the properties of the OLS estimate is dependent on the
properties of the disturbance term in the regression analysis. So what happens to our estimates
when the conditions imposed on the disturbance term are not satisfied.
5.1 Heteroskedasticity
So far we have assumed that:
i,  u2i   u2 .
This is known as homoskedasticity.
How can an observation have a variance? Since the observation only takes 1 value, we mean its
potential behaviour before the sample is generated.
When we write the model :
Y   0  1 X  u
We state that the disturbance terms u1,…un are drawn from probability distributions that have 0
mean and the same variance. However, their actual value is random. Homoskedasticity states that
the probability of u >0 (or <0) is the same for all observations.
66
Y
Y = 1 +2X
1
X1
X2
X3
X4
X5
X
10
HETEROSCEDASTICITY
Y
Y = 1 +2X
1
X1
X2
X3
X4
X5
X
11
1
If there were no disturbance term in the model, all observations would be on the regression line.
The effect of the disturbance is to shift each observation upwards or downwards. The potential
distribution of the disturbance term is shown by the normal distributions at each point. In this
case we have homoskedasticity, each observation has a disturbance term with a distribution of
equal variance.
However, in some cases, we may think that the disturbance term is a function of a covariate X.
So that the disturbance term tends to lie close to the regression line for small values of X and far
away for large values of X.
Example: Relationship between manufacturing value added and GDP (1994).
When GDP is large a 1% variation in manufacturing value added makes a great deal more
difference than when GDP small, hence, variations in manufacturing output (expressed in $) are
larger for large value of GDP. (should have fitted a log-log model).
300000
Manufacturing
250000
200000
150000
100000
50000
0
0
200000
400000
600000
800000
GDP
1
1000000
12000
Heteroskedasticity does:
-
not bias in OLS estimates (the proof of unbiasness does not rely on the homoskedasticity
assumption).
-
not bias R2 ( R 2  (1   u2 ) /  2y ). Since both variance are estimated unconditionally, they
are not affected by bias in the conditional variance
-
Bias the standard error of the OLS estimates (no valid testing)
Reminder:
b  ( X ' X ) 1 X ' Y
b    ( X' X ) 1 X' 
Therefore,
(b  )(b  )'  ( X' X) 1 X'  ' X( X' X) 1
taking expectations
E[(b  )(b  )' ]  ( X' X) 1 X' E ( ' )X(X' X) 1
Assuming homoskedasticity we had: V ()   2 I


E[( b   )( b   )' ]  ( X' X ) 1 X'  2 I T X ( X' X ) 1
  2 ( X' X ) 1
with heteroskedasticity:
E[( b   )( b   )' ]  ( X' X ) 1 X'  2 X ( X' X ) 1
2
5.2 Detection of heteroskedasticity
Numerous tests have been devised, we will concentrate here on 4 that hypothesized a relationship
between the variance of the disturbance term and the size of the explanatory variable.
5.2.1 Spearman rank correlation test
This test assumes a correlation between the absolute size of the residuals and the size of X in an
OLS regression. Both X and e are ranked and we define Di has the differences between the rank
of X and e for an observation i.
Under H0 (homoskedasticity) the population correlation coefficient is 0, and the rank correlation
coefficient is distributed N(0,1/(n-1)).
n
Where r Xe  1 
6 D i2
i 1
2
n( n  1)
The null hypothesis of homoskedasticity is rejected at the 5% level if | r Xe * n  1 | N c
From the above example, we get
 D 2 =1608, so rXe=1- (6*1608)/(28*783)=.56
The t-statistics is then t=.56* 27 =2.91.
t>tc so we reject H0 (homoskedasticity). … reject at the 1% level also (tc=2.58)
If there is more than one observation in the model, the test may be performed with any one of
them.
5.2.2 Goldfeld-Quandt test
The idea is to divide your sample and do regressions using only the tail of the distribution in X. If
heteroskedasticity is true then the RSS in the second subsample is going to be larger than in the
fist one.
Formally, the observations are ranked by X and to samples (the first n’ and last n’) observations
are created. The ratio RSS2/RSS1 is distributed as an F(n’-k, n’-k). Goldfeld and Quandt
recommends to use n’=3n/8.
3
H0: RSS2 not greater than RSS1
If F=RSS2
> Fc
vs.
H1: RSS2>RSS1
reject H0 and accept heteroskedasticity.
RSS1
Remark: if the standard deviation of the disturbance term is inversely related to X, the test
becomes : RSS1/RSS2.
In the previous example, RSS1=157 and RSS2= 13,518, so F=86.1 > F95(9,9)=3.18. Reject H0.
5.2.3 Glejser test
Here we do not assume that  ui is proportional to Xi and we investigate various alternative
functional form:
 u   0  1 X i .
i
H0:  1 =0: homoskedasticity vs.
H1:  1  0 .
Say you test values of gamma between –2 and 2. Then you pick the model with the best fit
(highest R2) and test H0.
5.2.4 Breush-Pagan test
Regress uˆ 2   X   .
The F statistic of this regression has an F(k, n-k-1) distribution under the null hypothesis of
homoskedasticity.
H0: Homoskedasticity
H1: Heteroskedasticity
5.3 Robust standard error
It is possible to correct standard errors so that they are robust to unknown form of
heteroskedasticity. (White, 1980).
4
n
As seen in Lecture 2,  b21 
 ( x i  x ) i2
i 1
nˆ x2
.
(5.1)
White (1980) shows that the square residuals in observation i can be used as an estimator of  i2 .
Thus the White (also called Hubert-White, or simply robust) variance is given by:
n
 b2 
 ( x i  x )uˆ i2
i 1
1
(5.2)
nˆ x2
The argument hinges on the fact that asymptotically, (5.2) converges towards (5.1) (using TCL).
One may wonder whether to use heteroskedasticity consistent (robust) standard error all the time?
In fact, reader of Scott and Watson are advised to do so. One reason you may want to be cautious
about using robust standard error in all cases is that the robust standard error formula converges
towards the true standard error only for large samples, in small sample, the bias induced by robust
standard error may be quite large.
Alternatively, you may decide to report both standard error, and let the reader decides whether
your conclusions are sensitive to the standard error used (Wooldrige).
The best strategy is to test for heteroskedasticity and if present use robust standard error.
Californian school case:
* Goldfeld-Quant test:
. sort str
. reg
testscr
str
avginc avginc2 if _n<=157
Source |
SS
df
MS
-------------+-----------------------------Model | 37291.1283
3 12430.3761
Residual | 26564.4464
153 173.623833
-------------+-----------------------------Total | 63855.5747
156 409.330607
Number of obs
F( 3,
153)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
157
71.59
0.0000
0.5840
0.5758
13.177
-----------------------------------------------------------------------------testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -.3953029
.9133449
-0.43
0.666
-2.199698
1.409092
5
avginc |
3.227624
.4699859
6.87
0.000
2.299125
4.156124
avginc2 | -.0327779
.0088873
-3.69
0.000
-.0503356
-.0152203
_cons |
623.442
16.63393
37.48
0.000
590.5802
656.3039
-----------------------------------------------------------------------------. reg
testscr
str
avginc avginc2 if _n>=263
Source |
SS
df
MS
-------------+-----------------------------Model | 23852.7352
3 7950.91172
Residual | 26041.0929
154 169.098006
-------------+-----------------------------Total |
49893.828
157 317.795083
Number of obs
F( 3,
154)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
158
47.02
0.0000
0.4781
0.4679
13.004
-----------------------------------------------------------------------------testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str |
.6105634
.9286054
0.66
0.512
-1.223885
2.445012
avginc |
4.252809
1.309941
3.25
0.001
1.665037
6.840581
avginc2 | -.0459702
.0420159
-1.09
0.276
-.1289721
.0370317
_cons |
587.3276
23.1539
25.37
0.000
541.5874
633.0679
------------------------------------------------------------------------------
F(157-4,157-4)=RSS1/RSS2=1.02
Fc(120,120)=1.35, Fc(inf,inf)=1
No strong evidence of heteroskedasticity.
* Breusch- Pagan test
reg
testscr
str
avginc avginc2
Source |
SS
df
MS
-------------+-----------------------------Model | 85759.8879
3 28586.6293
Residual | 66349.7057
416 159.494485
-------------+-----------------------------Total | 152109.594
419 363.030056
Number of obs
F( 3,
416)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
420
179.23
0.0000
0.5638
0.5607
12.629
-----------------------------------------------------------------------------testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -.9099512
.3373245
-2.70
0.007
-1.573024
-.2468782
avginc |
3.881859
.302214
12.84
0.000
3.287802
4.475916
avginc2 |
-.044157
.0062511
-7.06
0.000
-.0564448
-.0318692
_cons |
625.2308
7.301822
85.63
0.000
610.8777
639.5839
-----------------------------------------------------------------------------. predict temp,resid
. gen e2=temp^2
. reg
e2
str
avginc avginc2
Source |
SS
df
MS
-------------+-----------------------------Model | 345803.306
3 115267.769
Residual | 22650470.8
416
54448.247
-------------+-----------------------------Total | 22996274.1
419
54883.709
Number of obs
F( 3,
416)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
420
2.12
0.0974
0.0150
0.0079
233.34
-----------------------------------------------------------------------------e2 |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+----------------------------------------------------------------
6
str |
3.403189
6.232568
0.55
0.585
-8.848063
15.65444
avginc | -4.656416
5.583849
-0.83
0.405
-15.63249
6.31966
avginc2 |
.0211701
.1154992
0.18
0.855
-.2058646
.2482049
_cons |
156.3866
134.9119
1.16
0.247
-108.8074
421.5807
------------------------------------------------------------------------------
F(3,416)=2.68, so F<Fc reject heteroskedasticity
Stata can directly implement the chi(2) version of this test, just type hettest after the initial
regression (need to check H0 and the p-value there…)
. reg
testscr
str
avginc avginc2
Source |
SS
df
MS
-------------+-----------------------------Model | 85759.8879
3 28586.6293
Residual | 66349.7057
416 159.494485
-------------+-----------------------------Total | 152109.594
419 363.030056
Number of obs
F( 3,
416)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
420
179.23
0.0000
0.5638
0.5607
12.629
-----------------------------------------------------------------------------testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -.9099512
.3373245
-2.70
0.007
-1.573024
-.2468782
avginc |
3.881859
.302214
12.84
0.000
3.287802
4.475916
avginc2 |
-.044157
.0062511
-7.06
0.000
-.0564448
-.0318692
_cons |
625.2308
7.301822
85.63
0.000
610.8777
639.5839
-----------------------------------------------------------------------------. reg
testscr
str
avginc avginc2,robust
Regression with robust standard errors
Number of obs
F( 3,
416)
Prob > F
R-squared
Root MSE
=
=
=
=
=
420
286.55
0.0000
0.5638
12.629
-----------------------------------------------------------------------------|
Robust
testscr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------str | -.9099512
.3545374
-2.57
0.011
-1.606859
-.2130432
avginc |
3.881859
.2709564
14.33
0.000
3.349245
4.414474
avginc2 |
-.044157
.0049606
-8.90
0.000
-.053908
-.034406
_cons |
625.2308
7.087793
88.21
0.000
611.2984
639.1631
------------------------------------------------------------------------------
Not surprisingly, in this case the difference between robust and non-robust standard error is not
really large since not much heteroskedasticity (check data as well).
7
test score
706.75
605.55
25.8
14
Student teacher ratio
5.4 Stochastic regressors
So far we have assumed that the regressors are non-stochastic: i.e they do not have a random
component, their value is fixed and not affected by the way the sample is selected.
Remember, we know that:
b2 
cov( X , Y )
cov( X , u )
 2 
var( X )
Var ( X )
E (b2 )   2  Ecov( X , u )  / var( X )
If X is non stochastic,   2  [ E (cov( X , u ) / var( X )]
 2
However, if X is stochastic, we no longer have E (var( X ))  var( x)
Could we still proof that OLS is an unbiased estimate.
8
1 n
( X i  X )(ui  u )
cov( X , u ) n 
i 1

var( X )
var( X )
 Xi  X 
(u  u )
1 n 
   var( X )  i
n i 1
 cov( X , u )  1 n  X i  X 
 E (ui  u )  0
   E 
E 
 var( X )  n i 1  var( X ) 
Since E (ui  u )  0
So b2 is still an unbiased estimate.
5.5 Omitted variables and proxy variables
5.5.1 Omitted variable bias
Say that the true model is:
Y   o  1 X 1   2 X 2  u
However, you estimate:
Yˆ  b0  b1 X 1
Were
b1 
b1 
cov( X 1 , Y )
var( X 1 )
rather
than
(5.3)
the
cov( X 1 , Y ) var( X 2 )  cov( X 2 , Y ) var( X 1 )
var( X 1 ) var( X 2 )  cov( X 1 , X 2 )
2
The OLS estimate of b1 is then biased.
b1 

cov( X 1 ,  0  1 X 1   2 X 2  u )
var( X 1 )
cov( X 1 ,  0 )  1 var( X 1 )   2 cov( X 1 , X 2 )  cov( X 1 , u )
var( X 1 )
 1 
 2 cov( X 1 , X 2 )
var( X 1 )

cov( X 1 , u )
var( X 1 )
so if X1 and X2 are non-stochastic: E (b1 )  1   2
cov( X 1 , X 2 )
var( X 1 )
in case of omitted variable, OLS will be biased unless cov(X1,X2)=0.
9
correct
expression:
In a multivariate case, it is more difficult to predict the impact of omitted variable bias
mathematically.
5.5.2 Adding non-necessary variables
Suppose the true model is: Y   0  1 X  u but you estimate: Y   0  1 X 1   2 X 2  u
OLS estimate will still be unbiased but no longer efficient since:
 u2
 u2
1
2
rather than  b1 
 
n var( X 2 )
n var( X 2 ) 1  rx21 , x2
2
b1
furthermore adding more covariates may lead to problems of multicollinearity. If the covariates
are highly correlated, the variance of the estimate will be large.
5.5.3 Proxy variables
If a variable that you would like to introduce in your analysis is missing, rather than forgetting all
about it (omitted variable bias) it may be possible to use a proxy variable. A proxy variable is
correlated with the unobserved variable, say the relationship is of the following form:
X 2   0   1Z .
So the estimated model becomes:
Y   0  1 X 1   2 ( 0   1 Z )
- b1 will be the same as if X2 had been used in the regression.
- R2 will be the same as if X2 had been used in the regression.
- bZ is an estimate of  2 1 not  2
- The t-stat for Z is the same as the one that would have been obtained for X2.
- The intercept is an estimate of  0   2 0 not  0 .
For example, socioeconomic status and income are proxy variables for each other.
5.6 Measurement error
5.6.1 Measurement error in the explanatory variables
10
Yi   0  1 Z i   i
however, Z cannot be measured accurately, all that is observable is X:
X i  Zi  i
We suppose that w ~ N (0, 2 ) and Z has a variance  Z2 .
Then we have: Yi   0  1 X i   i  1 i , let’s define ui   i  1 i .
So the OLS estimate is: b1 
cov( X , Y )
cov( X , u )
 1 
var( X )
var( X )
But we know that cov(X,u)  0 since Xi and ui are correlated. Even in the case of an infinite
sample, this estimate would still be unconsistent. The bias can be calculated.
cov( X , u )  cov( Z   ), (  1 )
 cov(Z,  ) - 1cov(Z,  )  cov( , ) - 1 cov( ,  )
 0 - 1 0  0  1 2
thus, the bias of OLS estimate: b 
 w2

 Z2   w2 1
A solution to deal with measurement error is to use an alternative estimator (instrumental
variable).
5.6.2 Measurement error in dependent variable
Measurement error in the dependent variable do not matter as much as it can be thought to
contribute to the error term. However, by increasing the amount of noise in the model, it will tend
to increase standard error.
Say that the true relationship is: Qi   0  1 X 1   i
However, we only observe: Yi  Qi  ri
Hence, we model: Yi   0  1 X 1   i  ri
We know that the population variance is:  x21 
 2   r2
n x2
1
11
5.7 sample selectivity
Sample selection occurs when the availability of the data is influenced by s selection process that
is related to the value of the dependent variable. This selection introduces correlation between the
error term and the regressor and therefore leads to bias OLS estimates.
Example: You are interested in the returns to education and want to regress log wage on
schooling. However, only working individuals will have a wage. Since the probability of working
is not independent on schooling, nor of the error term. The simple fact that someone has a job and
thus appears in the dataset provides information that the error term in the regression is positive
(on average) and could be correlated with the regressors, this leads to bias OLS estimator.
Solutions to sample selectivity will be seen at the end of this course (or Ecometrics II)
5.8 reversed causality
So far we have assumed that causality runs from the regressors to the dependent variable (X
causes Y), but what if the causality also runs the other way (Y causes X). This is called reversed
causality, in such cases OLS estimator is biased and inconsistent.
Suppose we have the simultaneous system:
The OLS estimate is: b1  1 
Yi   0  1 X i  ui
X i   0   1Yi   i
cov( X i , ui )
var( X i )
cov( X i , ui )  cov( 0   1Yi   i , ui )
  1 cov(Yi , ui )  cov( i , ui )
  1 cov(  0  1 X i  u i , ui )  0
  1 1 cov( X i , ui )   1 u2
 1 u2
cov( X i , ui ) 
1   1 1
12
Thus in case of reverse causality, OLS will be an inconsistent estimate.
To deal with simultaneous bias, we can rely either on IV or on experimental data in order to
prevent one of the causality.
13
5 Instrumental variables regression
As seen previously in case of omitted variable, reversed causality or measurement error, OLS
estimates can be biased. IV is a general way to obtain consistent estimator of the unknown
coefficients of the population when the regressor X is correlated with the error term u.
5.1 Model and assumption
To understand IV, assume that X is composed of two parts, one that is correlated with u and one
that is orthogonal to the error. IV allows you to isolate the part of X that is uncorrelated with u.
The information about the exogenous component of X is gathered thanks to a (group) of variable
called the instrument(s).
Yi   0  1 X i  ui
where corr(X,u)  0 .
Due to omitted variables, X and u are correlated, thus OLS are inconsistent.
Instrumental variables estimations uses an additional variable Z (instrumental variable), to isolate
that part of X that is uncorrelated with ui.
Variables correlated with the error terms are called endogenous, while those uncorrelated with it
are called exogenous.
Conditions for a valid instrument:
-
corr (Zi,Xi)  0
The instrument is relevant
-
corr (Zi,ui)=0
Instrument is exogenous
If an instrument is relevant, then variation in the instrument is related to variation in Xi. If in
addition the instrument is exogenous, then that part of the variation of Xi captured by the
instrumental variable is exogenous. Thus, the instrument captures movement in Xi that are
exogenous, and can therefore be used to estimate  1 .
5.2 2-stage least squares estimator
The first stage decomposes X into two components:
X i   0   1Zi   i
14
 i component of X correlated with u
 0   1 Z i : component of X independent of u
In the second stage the component of X independent of u is used to estimate b1., so we regress:
Yi   0   1 Xˆ i  u i
The second stage is complicated by the fact that ˆ 0 and ˆ1 are estimates and thus, the standard
errors need to be corrected.
In large sample, the 2SLS estimator is consistent and normally distributed.
Example 1 : Despite all of our care, the estimate of the effect of pupils teacher ratio on test score
may still be biased if they suffer from omitted variable bias. As an instrument, we would like a
variable that is correlated with the PTR but not with any other variables affecting student
performance. It would then be exogenous since it is uncorrelated with the error term.
California is affected by earthquake, that may damage some schools but not other. An affected
school will close down and other schools in the area would have to double up to accommodate
the extra students. Hence, distance to the earthquake epicentre may be considered as a valid
instrument. (more example to follow).
In the simple case of a one variable model, the 2SLS estimator is given by a simple formula.
The second stage is estimated by OLS, so we know that:
ˆ1  cov( Xˆ , Y ) / var( Xˆ )
where Xˆ  ˆ0  ˆ1Z
Hence,
cov( Xˆ , Y )  ˆ1 cov( Z , Y )
var( Xˆ )  ˆ 2 var( Z )
1
Since ˆ1 is estimated by OLS in the first stage of the 2SLS, we know that:
ˆ1  cov( Z , X ) / var( Z )
Hence: ˆ12 SLS  cov( Z , Y ) / cov( Z , X )
15
Is the 2SLS estimator consistent?
cov( Z , Y ) 

1 n
 Z i  Z Yi  Y 
n  1 i 1
1 n
 Z i  Z 1 X i  X   ui  u 
n  1 i 1
1 n
 1 cov( Z , X ) 
 Z i  Z ui  u 
n  1 i 1
 1 cov( Z , X ) 
1 n
 Z i  Z ui 
n  1 i 1
hence, we have:
1
ˆ12 SLS  1 
1
n
n
i 1
( Z i  Z )ui
n
 (Z i  Z )( X i  X )
(*nominator and denominator by: n  1
n
n i 1
When the sample is large, Z   Z i.e. the sample mean is equivalent to the population mean.
Let’s define qi  ( Z i   Z )ui , since Z is exogenous, E(qi)=0.
The variance of q is:  q2  var Z i   Z u i 
q
 q2
1 n
2
q
,
then
var(
q
)



 i
q
n
n i 1
The TCL implies that
Hence ˆ12 SLS   1 
q
 q2
~ N (0,1)
q
and in large sample is distributed approximately like a normal:
cov( Z i , X i )
N ( 1 ,  2 2 SLS )
1
TCL infers that the 2SLS estimator is normally distributed, so t-test can be used to test the
significance of the estimate. The variance of the estimator is given by:
 2
2 SLS
1

1 var Z i  Z ui 
n [cov( Z i , X i )] 2
16
Application: demand for cigarettes:
Taxing cigarettes should reduce consumption and discourage smokers, how much it will affect
consumption depends on the price elasticity of the demand for cigarettes. If the elasticity is -.5,
price must rice by 40% to reduce consumption by 20%. Using data from 1985-95 by state, we try
to estimate this elasticity. We believe that price is endogenous (reverse causality P,Q), hence we
need to find an instrument. We rely on sale tax. The tax is correlated with the total price, hence it
satisfies the first condition for instrument, but is it independent of the error term. The argument
here centre on the fact that sale tax vary by states for policy reasons (differences in the mix
between general taxation, differences in social choices,..) so we think that sale tax is uncorrelated
with the error term.
1st stage:
. reg
lnprice saletax if year==1995
Source |
SS
df
MS
-------------+-----------------------------Model | .361461579
1 .361461579
Residual |
.40597912
46 .008825633
-------------+-----------------------------Total |
.7674407
47 .016328526
Number of obs
F( 1,
46)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
48
40.96
0.0000
0.4710
0.4595
.09394
-----------------------------------------------------------------------------lnprice |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------saletax |
.0201633
.0031507
6.40
0.000
.0138213
.0265053
_cons |
5.037885
.0291078
173.08
0.000
4.979294
5.096476
-----------------------------------------------------------------------------( 1) saletax = 0
F(
1,
46) =
Prob > F =
40.96
0.0000
Variation in sales tax explains 46% of the variance of cigarette prices across states.
2nd Stage
Source |
SS
df
MS
-------------+-----------------------------Model | .424413989
1 .424413989
Residual | 2.35880879
46 .051278452
-------------+-----------------------------Total | 2.78322278
47 .059217506
Number of obs
F( 1,
46)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
48
8.28
0.0061
0.1525
0.1341
.22645
------------------------------------------------------------------------------
17
lnq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lnpriceh | -1.083587
.3766486
-2.88
0.006
-1.841741
-.3254326
_cons |
10.17643
1.959869
5.19
0.000
6.231423
14.12145
------------------------------------------------------------------------------
If we believe our instrument, the elasticity of demand for cigarette is surprisingly high. Our
estimate may be affected by omitted variable bias if for example, sales tax is dependent on
income (states with higher average income may rely less on sales tax and more on income tax),
and we know that consumption of cigarettes is correlated with income. Need to move to a more
general IV model.
How does this compare with the OLS estimate:
. reg lnq lnprice if year==1995
Source |
SS
df
MS
-------------+-----------------------------Model | 1.12929422
1 1.12929422
Residual | 1.65392856
46 .035954969
-------------+-----------------------------Total | 2.78322278
47 .059217506
Number of obs
F( 1,
46)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
48
31.41
0.0000
0.4058
0.3928
.18962
-----------------------------------------------------------------------------lnq |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lnprice | -1.213057
.2164497
-5.60
0.000
-1.648748
-.7773661
_cons |
10.85003
1.126459
9.63
0.000
8.582585
13.11748
------------------------------------------------------------------------------
5.3 General IV model
In general, a model will include a set of exogenous variable (W) that are not correlated with u and
an endogenous variable (X) that needs to be instrumented by (Z). If there is more than one
endogenous variables, we need at least as many instruments as there are endogenous variables.
The regression is exactly identified if there are as many instruments as endogenous variables, and
over-identified if more instruments are available. An underidentified model cannot be estimated.
The model of interest is therefore:
Yi   0  1 X 1   2W2  ..   rWr  u
The first stage is modeled as:
X 1   0   1 Z1  ..   m Z m   m1W2  ..   mrWr  
18
Note, that all exogenous variable (W) are included in the first stage regression.
in case there is more than one endogenous variable, each endogenous variable is estimated in a
similar first stage regression, and all predicted values are included in the second stage.
5.4 Instruments’ validity
Invalid instruments produce meaningless results, it is therefore crucial to test the instruments’
validity.
5.4.1 Relevance
The more relevant the instrument, the more exogenous variation in X is captured. A more
relevant instrument produces more accurate estimator.
1
Remember: ˆ12 SLS  1 
1
n
n
i 1
n
(Z
n
i 1
i
( Z i  Z )ui
 Z )( X i  X )
IV hinges on the 2SLS having a normal distribution. Similarly to having a larger sample size,
having the higher the correlation between Z and X, the better the normal approximation is.
Instruments that explain little of the variation in X are called weak instruments. If the instrument
is weak, the normal approximation is weak, even if the sample is large, and the 2SLS estimator is
biased. .
In large sample, Z and X are close to the population means. As previously, let’s define:
r
1 n
1 n
r

 i n  ( Z i   Z )( X i   X )
n i 1
i 1
Hence:
19
 q
q
  1    1  
r
r
ˆ12 SLS
since  r2 
 q

 q



  q 
  q   q 




 r    1     r

 r   

r


 r 
1 2
r
n
If the instrument is irrelevant, E(ri)=cov(Zi,Xi)=0, then the TCL applied to the denominator and
we know that
r
r
~ N (0,1)
When the instrument is irrelevant, the bias term is a function of the ratio of 2 normally distributed
random variables, that are correlated (X and u are correlated).
When the instrument is weak, the distribution of the 2SLS is still non normal.
It can be shown that:


ˆ12 SLS   1   1OLS   1 /[ E ( F )  1]
where E(F) is the expectation of the 1st stage
F statistics.
If E(F)=10, then the large sample bias of 2SLS relative to the large sample bias of OLS, is 1/9 or
just over 10%.
As a rule of thumb, instruments are called weak when a F-statistics on the coefficients of the
instrument in the first stage is less than 10.
Problems with weak instrument: Example.
Angrist and Krueger (1991) rely on quarter of birth as an instrument for education (SLA).
However, Bound, Jaeger and Baker (1995) noticed that in some regressions the instruments were
weak (F-test =2). They then run the same regression using a randomly generated quarter of birth
and found similar results to those of Angrist and Krueger. Since the sample size used was
extremely large, the estimates were statistically significant…
If you find yourself with weak instruments, it is possible to use Limited Information Maximum
Likelihood estimator which under these circumstances perform better than 2SLS. (see hayashi,
2000 for example).
20
5.4.2 Instrument exogeneity
If the instrument is not exogenous, it will then be related to the error term and therefore cannot
capture the exogenous component of X.
Unfortunately, there is no statistical test of the exogeneity of the instrument, so you should use
your opinion and argument.
If you have more instrument than endogenous variable, you can run an overidentifying restriction
test.
Imagine that you have 2 instruments and 1 endogenous variable. You could run 2SLS using
instrument 1 or instrument 2, if the instruments are both valid (exogenous), the estimates should
not be too different (since they both capture the exogenous component of X). If not, then at least
one of the instruments is not exogenous.
In fact, rather than estimating the various combinations of the instruments, the overidentification
test simply consist of regressing the following equation and testing that the coefficients on the
instruments are not significant.
uˆ i2 SLS   0   1 Z1i  ..   m Z m i   m 1W1i  ..   m  r Wri  e i
Ho:  1  ..   m  0
2
The statistics J=mF is in large sample distributed like a  m
k where m-k is the degree of
overidentification.
5.5 Examples of IV
5.5.1 Example from the returns to education literature:
Measurement error, b is biased downward (measurement error = 10%)
Omitted variable bias, b is biased upward.
there are two avenues to find a good instrument:
21
-Economic Theory (for example difference in cost that are independent of ability or
earnings)
-college proximity (Kane and Rouse, 1993 )
-Institutional constraints (natural experiment)
-Quarter of Birth (Angrist and Krueger, 1991)
-Change in compulsory schooling law (Harmon and Walker, 1995)
-war draft (Angrist and Krueger, 1992)
[table 4, Hdbook, p1835-36]
Econometric theory had predicted that due to omitted ability, OLS estimates were biased
upwards, however, all IV estimates are at least 30% higher than OLS. How can we explain this
puzzle?
-
the omitted variable bias is small compared to the measurement error bias, thus OLS are
bias downwards. However, since measurement error is around 10%, how can the IV
estimates be 30% higher than OLS.
-
Publication bias: Ashenfelter and Harmon (1998). Researchers/publishers prefer high
2
point estimate (evidence of + relationship between (bIV-bOLS) and  IV
.
-
Heterogeneity in the returns to education
If individual affected by the instruments have higher than average returns to education then IV
returns will be higher than OLS returns that are based on the mean. A condition for this to
happen is that the marginal rate of return to education is negatively correlated with the
educational attainment.
5.5.2 more detailed example
22
23
To avoid state specific effects, we use differences over a 10 year period rather than annual
variable. This strategy allows us to eliminate any state fixed effects. We will therefore measure
the long-term elasticity.
The first stage reveals that the instruments are significant.
Are the instrument exogenous. Column (3), J-stat=4.93>critical (chi2) = 3.84. So we reject H0
and conclude that at least one instrument is endogenous. The estimates from 1 and 2 are two
dissimilar to accept that both instruments have captured the exogenous component of X. One may
argue that general tax is more likely to be exogenous than cigarette specific tax (for example, if
consumption is reduced, the pro-lobby will loose power relative to anti-lobby, and politicians may
give in and increase cigarette tax).
5.6 LATE
So far we have assumed that causal effects are the same for all individuals. Angrist and Imbens
(1994) and Angrist, Imbens and Rubin (1996) have suggested the idea that the IV estimate can be
interpreted as the effect of the treatment on the subpopulation of treated who changed behaviour
because of the instrument, that is:
  E (Yi1 | Di  1, Z i  1)  E (Yi 0 | Di  0, Z i  0)
(2.6)
They call this interpretation the Local Average Treatment Effect (LATE). LATE requires some
assumptions.
Assumption 1: Stable Unit Treatment Value
If Zi=Zi’ then Di(Z)=Di(Z’)
If Zi=Zi’ and Di=Di’ then Yi (Z,D)=Yi (Z’,D’)
For individual i, the decision to obtain treatment and the outcome are independent of the
instrument, treatment and outcome of other individuals.
Assumption 2: Random Assignment
Pr(Zi=z)=Pr(Zi=z’)
24
Assumption 3: Exclusion Restriction
Z,Z’, D, Y(Z,D)=Y(Z’,D)
The effect of the instrument on the outcome is only through the treatment, cov(Y,Z)=0
Assumption 4: Nonzero average causal effect of Z on D
E[ Di (Z  1)  Di ( Z  0)]  0
Inbens and Angrist (1994) contribution is the final assumption:
Assumption 5: Monotonicity
i, D i (1)  Di (0)
The population can be divided into four groups;
D(0)=0, D(1)=0  Y(0,0)=Y(1,0)
D(0)=0, D(1)=1  Y(0,0)<Y(1,1)
Never Taker
Complier
D(0)=1,D(1)=0  Y(0,1)>Y(1,0)
D(0)=1, D(1)=1  Y(0,1)=Y(1,1)
Defier
Always taker
Monotonicity assumption rules out the defier category,
LATE is therefore identify by compliers only (similar to a fixed effect model)
Proof of LATE as in Angrist and Krueger (1999):
E[Yi | Z i  1]  E[Yi 0  (Yi1  Yi 0 ) * Di | Z i  1]
 E[Yi 0  (Yi1  Yi 0 ) * Di1 ]
by independen ce
Similarly,
E[Yi | Z i  0]  E[Yi 0  (Yi1  Yi 0 ) * Di 0 ]
Hence,
25
E[Yi | Z i  1]  E[Yi | Z i  0]  E[(Yi1  Yi 0 )( Di1  Di 0 )]
 E[(Yi1  Yi 0 ) | ( Di1  Di 0 )] * P( Di1  Di 0 ) Monotonici ty
Similarly,
E[ Di | Z i  1]  E[ Di | Z i  0]  E[( Di1  Di 0 )]
 P( Di1  Di 0 )
Thus,
E[Yi | Z i  1]  E[Yi | Z i  0]
 E[Yi1  Yi 0 | Di1  Di 0 ]
E[ Di | Z i  1]  E[ Di | Z i  0]
The treatment effect identified is an average for those who can be induced to change participation
status by a change in the instrument.
Dilemma:
If instrument is specific, then there is little scope for extrapolating the returns of the treatment to
the general population (Heckman’s argument).
If the instrument is generic, the monotonicity assumption is likely to be rejected.
26
Chapter 6
Simultaneous equations estimation
Most of the material for this chapter has been covered in the last two weeks, so this is only a
reminder of why OLS estimates are biased in a system of equation and how to use IV to get
unbiased estimates.
6.1 Simultaneous equations
So far we have assumed that causality runs from the regressors to the dependent variable (X
causes Y), but what if the causality also runs the other way (Y causes X). This is called reversed
causality, in such cases OLS estimator is biased and inconsistent.
Suppose we have the simultaneous system:
Yi   0   1 X i   2Wi  u i
(6.1)
X i   0   1Yi   2 K i   i
Endogenous variables are determined within the system of equations whilst exogenous variables
are determined outside it, so here X and Y are endogenous variables, and W and K are exogenous
variables. The exogenous variables and the disturbance terms eventually determine the value of
the exogenous variables, when the system is solved, this is usually referred as the reduced form of
the system.
The reduced forms of (6.1) are given by substituting X by its value in the first equation and
similarly Y by its value in the second equation.
Yi   0   1  0   1 Yi   2 K i   i    2W i  u i
Yi 
1
1   1 1
 0   1 0   1 2 K i   2Wi   1 i
 ui 
(6.2)
X i   0   1  0  1 X i   2Wi  u i    2 K i   i
Xi 
1
 0   1 1   1  2Wi   2 K i   1ui   i 
1  1 1
27
(6.3)
In the reduced form equations, Y and X are expressed as functions of exogenous and error terms
only. Since X is a function of u, the first equation of (6.1) clearly break the independence
assumption (cov(X,u)=0). Thus, if we were to estimate this equation by OLS, b1 would be a
biased estimate of  1 . Similarly, Y and  are correlated and the OLS estimate of  1 would also
be biased.
The OLS estimate is: b1  1 
cov( X i , ui )
var( X i )
cov( X i , u i )  cov(  0   1Yi   2 K i   i , u i )
  1 cov( Yi , u i )  cov( i , u i )
  1 cov(  0   1 X i   2W i  u i , u i )  0
  1  1 cov( X i , u i )   1 u2
cov( X i , u i ) 
 1 u2
1   1 1


1
 0   1  1   1  2Wi   2 K i   1 u i   i 
var( X i )  var 
 1   1 1

1

var  1  2W i   2 K i   1 u i   i 
(1 -  1 1 ) 2
 1  2  2  W2   22 K2   12 u2   2 



1

2 *  1  2  2 WK  2 *  1  2  1 Wu  2 *  1  2 W 
2 
(1 -  1 1 ) 

2 *  2  1 Ku  2 *  2 K  2 1 u

Since W, K, u and  are all exogenous variables, their covariance are 0, hence we have:
var( X i ) 
1
(1 -  1 1 )
2
 
1
2
2  W2   22 K2   12 u2   2 
28
b1  1 
1
(1 - 1 1 )
Thus,
 1  1   1 1 
2
 
1
 1 u2
1   1 1
2
2  W2   22 K2   12 u2   2 
 1 u2
 1  2 2  W2   22 K2   12 u2   2
1

 21 u2


 1    1 
2
2
2 2
2 2
2
1
  1  2   W   2  K   1  u   


 1  2 2  W2   22 K2   2   1 
 21 u2

 1 
2
2
2 2
2 2
2 
   2  2   2 2   2 2   2 














1 
W
2 K
1 u
 
1 2
W
2 K
1 u
 
 1 2
b1 is an inconsistent estimate of  1 .
A Monte Carlo Experiment
Say that we have the following true model:
p  1.5  0.5w  u p
w  2.5  0.5 p  0.4U  u w
remember, we know that the reduced forms are:
p
w
1
1  1 1
1
1  1 1


0
 1 0  1 2U  1u w  u p 
0
  1 1   2U   1u p  u w 
p and w are thus affected by variation in ui, the change in p is 1
 1 the change in w. Thus the
actual observations are shifted along a line with slope 1
 1 . OLS is a compromise between the
slope of the true relationship (  1 ) and the slop of the shift line ( 1
 1 ).
A sample of 20 observations is generated, 10 times:
Source |
SS
df
MS
-------------+-----------------------------Model | 16.1992808
1 16.1992808
Residual | 12.3040439
18 .683557993
-------------+-----------------------------Total | 28.5033247
19 1.50017498
Number of obs =
F( 1,
18) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
20
23.70
0.0001
0.5683
0.5443
.82678
-----------------------------------------------------------------------------p |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
29
-------------+---------------------------------------------------------------w |
1.031237
.2118354
4.87
0.000
.5861878
1.476287
_cons |
.4081077
.4563612
0.89
0.383
-.5506717
1.366887
------------------------------------------------------------------------------
when we know that the true b1 and b0 are 0.5 and 1.5. Replicating this 20 times, we get:
sample
b0
Se(b0)
b1
se(b1)
1
0.41
0.46
1.03
0.21
2
0.45
0.38
1.06
0.17
3
0.65
0.27
0.94
0.12
4
0.41
0.39
0.98
0.19
5
0.92
0.46
0.77
0.22
6
0.26
0.35
1.09
0.16
7
0.32
0.39
1
0.19
8
1.06
0.38
0.82
0.16
9
-.08
0.36
1.16
0.18
10
1.12
0.43
0.69
0.20
The estimates are heavily biased, every estimate of the slope is above 0.5 and those of the
intercept are always below 1.5
30
5
4
3
2
1
0
0
1
2
31
3
4
5
4
3
2
1
0
0
1
2
3
6.2 Instrumental variables estimation
As we saw last week,
b12 SLS 
cov( Z , Y ) cov( Z , p)

cov( Z , x) cov( Z , w)
The reduced forms show that w is correlated with U, but since U is exogenous, we know that U is
not correlated with up. Thus, we can use U has an instrument for w.
32
4
b1IV 
cov(U ,  0  1 w  u p )
cov(U , w)

cov(U ,  0 )  cov(U , 1w)  cov(U , u p )
cov(U , w)
 0  1 
cov(U , u p )
cov(U , w)
. reg w U
Source |
SS
df
MS
-------------+-----------------------------Model | 8.18671935
1 8.18671935
Residual | 7.04603377
18
.39144632
-------------+-----------------------------Total | 15.2327531
19 .801723848
Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
20.91
0.0002
0.5374
0.5117
.62566
-----------------------------------------------------------------------------w |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------U | -.4438172
.0970477
-4.57
0.000
-.6477069
-.2399275
_cons |
3.911334
.4470388
8.75
0.000
2.972141
4.850528
-----------------------------------------------------------------------------. predict what if e(sample)
(option xb assumed; fitted values)
(80 missing values generated)
. reg p what
Source |
SS
df
MS
-------------+-----------------------------Model |
.31919131
1
.31919131
Residual | 28.1841334
18 1.56578519
-------------+-----------------------------Total | 28.5033247
19 1.50017498
Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
=
20
=
0.20
= 0.6570
= 0.0112
= -0.0437
= 1.2513
-----------------------------------------------------------------------------p |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------what |
.1974561
.4373319
0.45
0.657
-.7213441
1.116256
_cons |
2.050352
.9056883
2.26
0.036
.1475713
3.953132
------------------------------------------------------------------------------
In most cases, it can be argued that some variables are endogenous (measured with error), but is
this problem sufficient to grant the use of IV? If we suspect simultaneous bias, OLS will be
inconsistent and IV is to be preferred. However, if there is no endogeneity, both OLS and IV will
be consistent but OLS will be more efficient. We can test whether IV are needed by using a
Durbin-Wu-Hausman test. The Hausman test tests whether the estimates obtained with IV are
statistically different from those with OLS. The difference in the estimates is distributed as a Chisquared with k degrees of freedom, where k is the number of instrumented variables.
H0:  IV   OLS vs.  IV   OLS
If  2 (k )   c2 (k ) accept the null, the coefficients are not statistically significant, therefore should
use OLS which is more efficient. Otherwise, reject the null, and should do IV.
33
. ivreg p (w=U)
Instrumental variables (2SLS) regression
Source |
SS
df
MS
-------------+-----------------------------Model | 5.60960442
1 5.60960442
Residual | 22.8937202
18 1.27187335
-------------+-----------------------------Total | 28.5033247
19 1.50017498
Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
0.25
0.6225
0.1968
0.1522
1.1278
-----------------------------------------------------------------------------p |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------w |
.1974561
.3941549
0.50
0.622
-.6306327
1.025545
_cons |
2.050352
.8162714
2.51
0.022
.3354291
3.765274
-----------------------------------------------------------------------------Instrumented: w
Instruments:
U
-----------------------------------------------------------------------------. hausman, save
Note, that the ivreg command in stata, compute the 2 stages least square estimates and correct
the standard errors.
. reg p w
Source |
SS
df
MS
-------------+-----------------------------Model | 16.1992808
1 16.1992808
Residual | 12.3040439
18 .683557993
-------------+-----------------------------Total | 28.5033247
19 1.50017498
Number of obs
F( 1,
18)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
20
23.70
0.0001
0.5683
0.5443
.82678
-----------------------------------------------------------------------------p |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------w |
1.031237
.2118354
4.87
0.000
.5861878
1.476287
_cons |
.4081077
.4563612
0.89
0.383
-.5506717
1.366887
-----------------------------------------------------------------------------. hausman, constant sigmamore
---- Coefficients ---|
(b)
(B)
(b-B)
sqrt(diag(V_b-V_B))
|
Prior
Current
Difference
S.E.
-------------+------------------------------------------------------------w |
.1974561
1.031237
-.8337813
.1965241
_cons |
2.050352
.4081077
1.642244
.3870806
--------------------------------------------------------------------------b = less efficient estimates obtained previously from ivreg
B = fully efficient estimates obtained from regress
Test:
Ho:
difference in coefficients not systematic
chi2(
1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
=
18.00
Prob>chi2 =
0.0000
The chi-squared critical value with 1 degree of freedom is 10.8 at the 0.1 % level, so at all level
of statistical significance we can reject the null hypothesis that OLS and IV estimates were the
same, therefore the model should be estimated by IV.
34
Chapter 7 Experiments and quasi experiments
Experiments are fairly rare in social sciences, however there are 3 reasons to study them. 1)
provides a benchmark against which to judge estimates of causal effects in practice, 2) understand
limits and threats to the validity of experiments, 3) understand quasi-experiment (natural
experiment). This is part of econometrics called programme evaluation which is concerned with
estimating the effect of a program (policy intervention) on the treated population.
7.1 Idealised experiments and causal effects
A randomised controlled experiment randomly selects subjects from a population of interest then
randomly assigns them either to the treatment group or control group – the treatment is assigned
independently of any of the determinants of the outcome (thus no omitted variable bias). The
causal effect of the treatment is the difference in the mean outcome of the two groups.
Say Xi indicates the level of treatment received by individual i, this could be a binary variable
(treated x=1, control X=0) or continuous (indicating the intensity of the treatment). If Xi is
randomly assigned, then Xi is distributed independently of the omitted factors ui. Random
assignment of the treatment means that E(ui/Xi)=0 holds automatically.
Hence,
E (Yi / X i )   o  1 X i
(7.1)
The causal effects of the treatment on Y is then simply the difference in the conditional
expectations:
ATE  E (Y / X  x)  E (Y / X  0)
Because of random assignment,  1 measures the causal effect of a unit change in X, and is the
treatment effect on Y.
The causal effect can be estimated by the difference in the mean outcomes of both groups or
equivalently by  1 . This is known as the difference estimator. By randomly assigning treatment,
an ideal randomized controlled experiment eliminates correlation between the treatment Xi and
the error ui, so that the difference estimator is an unbiased and consistent estimate of the causal
effect of the treatment on Y.
35
. gen y=1.5+2*treat+u
/* true model */
Compmean y, group(treat)
Two-sample t test with equal variances
-----------------------------------------------------------------------------|
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------x |
54
1.550462
.1313592
.965289
1.286988
1.813935
y |
46
3.423776
.1564858
1.061338
3.108598
3.738955
---------+-------------------------------------------------------------------combined |
100
2.412186
.137527
1.37527
2.139303
2.68507
---------+-------------------------------------------------------------------diff |
-1.873315
.2027553
-2.275676
-1.470953
-----------------------------------------------------------------------------Degrees of freedom: 98
Ho: mean(x) - mean(y) = diff = 0
Ha: diff < 0
t = -9.2393
P < t =
0.0000
Ha: diff ~= 0
t = -9.2393
P > |t| =
0.0000
Ha: diff > 0
t = -9.2393
P > t =
1.0000
. reg y treat
Source |
SS
df
MS
-------------+-----------------------------Model | 87.1712191
1 87.1712191
Residual | 100.074215
98 1.02116546
-------------+-----------------------------Total | 187.245434
99 1.89136802
Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
100
85.36
0.0000
0.4655
0.4601
1.0105
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------treat |
1.873315
.2027553
9.24
0.000
1.470953
2.275676
_cons |
1.550462
.1375154
11.27
0.000
1.277567
1.823356
------------------------------------------------------------------------------
7.2 Potential problems with experiments in practice
*Threat to internal validity
- Failure to randomise
Failure to randomise the assignment to control and treated group means that assignment is
correlated with some of the individual characteristics and therefore error term. This leads to bias
estimates of the causal effect, since it captures the treatment and selection effects.
- Failure to follow treatment protocol
People do not always follow what they are told. For example, some of the individual affected to
the treatment may refuse to take it, similarly, some of the control individual may manage to get
some treatment. Because there is an element of choice in whether the subject receives the
treatment, Xi is going to be correlated with ui. Alternatively, some individuals may forget to take
the treatment which is equivalent to a measurement error bias. In all cases, partial compliance
will lead to bias estimates of the causal effect of the treatment.
36
- Attrition
This refers to subject dropping out of the study after being allocated to one group. If attrition
occurs for reasons correlated with the assignment to the program, the treatment is correlated with
ui.
- Experimental effects
The mere fact that subjects are in an experiment can change their behaviour (Hawthorne effect).
For example, individual in the experiment may increase their effort. To alleviate this problem it is
sometime possible to use a double blind protocol where the subjects do not know which group
they have been allocated to (usually not in social science). Example, teachers and small class
room.
- Small sample
Because experiments are pricey, sample size can be quite small. The estimate may then be
imprecise.
*Threat to external validity
compromise the ability to replicate the results of the study to other populations
-non-representative sample
The population studied and the population of interest must be sufficiently similar.
-non representative programme
the programme of interest must be sufficiently similar to the policy implemented. Differences in
funding, and duration can also be important.
-general equilibrium effects
Turning a small scale and duration programme into a widespread permanent programme might
change the economic environment sufficiently that the results of the experiments cannot be
generalised. Reducing class size, will affect the demand for teachers and potentially attract to
teaching individuals of lower quality, thus reducing the effect of the programme.
- treatment vs. eligibility effects
When the programme becomes a policy, participation is no longer random but voluntary. Thus
the experiment will not provide an unbiased estimator of the programme
37
7.3 Regression estimator of causal effects
7.3.1 Difference estimator with additional regressor
Characteristics that are relevant to determining the experimental outcome can be added. For
example is a drug test, age, gender and preexisting medical conditions are important determinant
of the outcome of the experiment. The model estimated becomes:
Yi   0  1 X i   2Wi  ui
(7.2)
It is important that none of the component of Wi are outcome of the experiment, otherwise Wi is
endogenous. So Wi have to be pre-treatment characteristics.
Including additional regressors mean that:
-
b1 is more efficient (smaller variance) since the variance of the error term is reduced by
the inclusion of the additional variables.
-
If the treatment in assigned in a way related to W, then the difference estimator (7.1) is
inconsistent and different from (7.2). Thus a large discrepancy between the 2 OLS
suggests that Xi was not randomly allocated.
-
The assignment may depend on pre-treatment characteristics, including these
characteristics controls for the probability that the participant is assigned to the treatment.
7.3.2 Differences in differences estimator
If data is available pre and post treatment, then a difference in differences estimator can be
computed. The estimator of the treatment is then the difference in the change in Y for the treated
and control groups.

 
b1dd  E (Y 1 / X  1)  E (Y 0 / X  1)  E (Y 1 / X  0)  E (Y 0 / X  0)

If the treatment is randomly assigned then bdd is an unbiased and consistent estimator of the
causal effect.
- Diff in Diff is more efficient than difference estimator if some unobserved determinants of Yi
are persistent over time for a given individual.
- Diff in diff estimator eliminates pre-treatment differences in Y. If the treatment is correlated
with the initial level of Yi, then the difference estimator is biased.
38
Y
Diff in Diff
Y0/x=1
Diff
Y0/x=0
Time
0
1
Y  E (Y 1 / x  1)  E (Y 1 / x  0)  80  30  50

 

Y  E (Y 1 / x  1)  E (Y 0 / x  1)  E (Y 1 / x  0)  E (Y 0 / x  0)  (80  40)  (30  20)  30
The diff in diff estimate removes the influence of the initial value of Y that vary systematically
between the treatment and control groups.
Depending on the way your data is organised, the difference in differences estimates can be
estimated as:
If on each line, you have outcome before and after, then the model takes the following form:
yi   0  1treat
reg dy treat
Source |
SS
df
MS
-------------+-----------------------------Model |
81.078286
1
81.078286
Residual | 228.058225
98 2.32712474
-------------+-----------------------------Total |
309.13651
99 3.12259101
Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
100
34.84
0.0000
0.2623
0.2547
1.5255
-----------------------------------------------------------------------------dy |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------treat |
1.80666
.3060794
5.90
0.000
1.199256
2.414065
_cons | -.4559498
.2075931
-2.20
0.030
-.8679116
-.043988
------------------------------------------------------------------------------
If each observation has the information reported on two different lines, then you need to create a
interaction term between treatment and time period (dt). The coefficient of this interaction is your
difference in differences estimate.
39
y   0  1treat   2 period   3treat 0  u
where treat0=1 for period 0 and 1 if individual is treated in period 1, whilst treat =1 in period 1
only if individual is treated. Thus treat=treat0*period.
. reg y dt treat year
Source |
SS
df
MS
-------------+-----------------------------Model | 94.3171008
3 31.4390336
Residual | 193.507926
196 .987285338
-------------+-----------------------------Total | 287.825027
199 1.44635692
Number of obs
F( 3,
196)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
200
31.84
0.0000
0.3277
0.3174
.99362
-----------------------------------------------------------------------------y |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dt |
1.80666
.2819425
6.41
0.000
1.25063
2.362691
treat |
.0666546
.1993635
0.33
0.738
-.3265183
.4598275
year | -.4559498
.1912227
-2.38
0.018
-.833068
-.0788316
_cons |
2.006411
.1352149
14.84
0.000
1.739749
2.273074
------------------------------------------------------------------------------
The coefficient on treat measures the effect of the fixed unobserved characteristics of the treated
group on the outcome (since random assignment, this is not significantly different from 0). Year
measures the effect of time that is identical for the two groups.
Y
3.34
treatment effect
2
time effect
1.54
0
1
Time
If you suspect that the causal effect may vary by subgroups (Z), then you can interact X and Z to
estimate the effect of the treatment for the group for which Z=0, and for which Z=1 (assuming Z
is binary).
. reg dy treat w0 w0t
Source |
SS
df
MS
-------------+------------------------------
Number of obs =
F( 3,
96) =
40
100
39.37
Model | 170.535057
3 56.8450192
Residual | 138.601453
96 1.44376514
-------------+-----------------------------Total |
309.13651
99 3.12259101
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
0.0000
0.5516
0.5376
1.2016
-----------------------------------------------------------------------------dy |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------treat |
1.961061
.2661259
7.37
0.000
1.432805
2.489316
w0 | -2.752882
.4867962
-5.66
0.000
-3.719165
-1.786599
w0t |
.4011066
.6491931
0.62
0.538
-.8875315
1.689745
_cons | -.0990947
.1752667
-0.57
0.573
-.4469963
.2488069
------------------------------------------------------------------------------
Individuals for which w0=1, have significantly lower change in outcome, but treated individuals
from this subgroup benefit more from the treatment (=.40 insignificant).
Testing for randomisation
If the treatment is randomly assigned, then Xi is not going to be correlated with observable
characteristics of the individual. This can be tested by regressing Xi (treatment) on the observable
and conduct a F-test.
H0:   0 vs.   0
. reg treat w0
Source |
SS
df
MS
-------------+-----------------------------Model | .336810773
1 .336810773
Residual | 24.5031892
98 .250032543
-------------+-----------------------------Total |
24.84
99 .250909091
Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
100
1.35
0.2486
0.0136
0.0035
.50003
-----------------------------------------------------------------------------treat |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------w0 |
.1545004
.1331174
1.16
0.249
-.1096667
.4186675
_cons |
.4337349
.0548857
7.90
0.000
.3248161
.5426538
-----------------------------------------------------------------------------( 1) w0 = 0.0
F( 1,
98) =
Prob > F =
1.35
0.2486
7.4 Quasi experiments
In a quasi (or natural) experiment, randomness is introduced by variations in individual
circumstances that make it appears as if the treatment is randomly assigned. These variations in
individual circumstances might arise because of legal institution, location, timing of policies,
natural effects or other factors that are unrelated to the causal effect under study.
41
If the quasi experiment determines treatment then the analysis can be conducted using the
previous econometric methods as if it was a controlled experiment (example of the Mariel
p
boatlift).
If the quasi experiment influence but do not completely determine the treatment, then the natural
experiment can be used as an instrument for X (example of change in school leaving age).
Since quasi experiment randomly assign treatment it is important to control for pre-treatment
characteristics.
Natural experiment are often estimated using pooled cross section. The estimate then takes the
form: Say, a policy was introduced in region A at period 1, but not in region B. We have data
from 2 cross sections, one before the policy was introduced, the other after policy was introduced
in region A. We define, the following variables. Survey =1 is the observation is taken from the
second survey and 0 otherwise. Region =1 if observation is from region A (what ever survey it
comes from) and treat=region*survey, and is therefore 1 for individuals in region A after the
experiment was introduced. b1 measures the causal effect of the policy on Y.
y   0  1treat   2 survey   3 region  u
* Threat to internal validity.
Quasi experiments rely on the randomisation of the treatment, this can be tested by regressing Z
on W and test that all coefficients are insignificant.
When the population is heterogenous, the estimated effect is the average causal effect. When the
experiment is used as an instrument, the estimate is the weighted average of the causal effect
where those most affected by the instruments receive higher weights.
Suppose now that Z has no influence on the treatment decision for p part of the population and
has a nonzero influence on the treatment decision for 1-p of the population. The 2SLS is a
consistent estimator of the average treatment effect for the population for which the instrument
influences the treatment decision (see discussion on LATE).
42
Topic IX:
Limited Dependent Variable Models
9.1
Overview
9.1.1 Motivation
Familiarity with microeconometrics techniques has been driven by a large number of factors and
influences. In no particular order…
Microdata availability – Household/individual/ firm level data increasingly available.
Increased computer power.
Avoiding aggregation bias – analysis based on individual data is persuasive.
Explaining behavioural differences across agents.
Distributional issues.
No-one has developed a ‘law’ in economics that follows with the fit and stability of the laws of,
say, physics. Nor is there any agreement on the ‘best’ way of doing things in economics. For the
development of econometrics as a discipline since the 20’s the analysis has concentrated on
macro analysis using time series data.
Analysis of individual agent data in pure
micro/tax/poverty/labour etc. was not apparent leading to huge developments in the theory
relating to these areas.
The economics journals are full now of applied studies in these fields and the econometric theory
has expanded to deal with the developments and needs of the research community.
It can be
readily seen why – the approach fits more closely with the accepted norms of the theory
underlying the empirical research. Think of aggregate statistics on poverty – per capita income
for example. The first thing we teach people taking courses in elementary statistics is the
unreliable nature of using data on per-capita income to infer anything about the distributional
nature of income in an economy.
There is no way of exploring for example the effects of a
policy change if all we examine is the aggregate poverty measure. In microeconometrics the unit
of analysis is the individual economic agent.
Instead of dealing with the notion of a
‘representative individual’ who conforms to some economic theory in their behaviour we can
43
look at the actual behaviour of the individual, avoiding the aggregation bias that has befallen
macro econometric analysis for some time.
Hopefully the microeconometrician may have some insight into how the data used was collected.
They may be able to understand and deal with the effects of measurement errors.
Macroeconomic data usually are constructed by government bodies or statistics agencies – they
may not make public the methods used to generate the data.
9.1.2 Problems
Endogenous selection, censoring and individual heterogeneity. Individual behaviour may be very
random and even the most successful method of estimation rarely ‘fit’ more than 30% of the
variability in the sample. The may offset the beneficial effects of the very large samples and the
disaggregated approach. The way therefore that we incorporate the random disturbances into the
model can be crucial and this topic has formed a large part of the theoretical advances in the
literature.
We also make much claim of the ‘explaining individual behaviour’ – but we can’t
deal with altruism, imitation etc. outside of the crude use of dummy variables or some other
device.
Robustness to considerations other than the economic considerations. This, as mentioned in the
last point, relates to ‘incidentals’ such as the fact that we are using the normal distribution in
assumptions about the behaviour of the error terms in a regression technique is not central to the
economic hypothesis being tested but may be critical to the estimation of the parameters of the
model.
Assumptions about the stochastic nature of the model are hugely important in this
literature because of the large number of non-linearities forced upon the structure by the problems
of endogenous selection for example.
9.1.3 New Developments
Semi-Non Parametric methods
Difference-in-Differences.
Quasi/Natural experiments.
Instrumental variables.
44
Developments in the use of panel data.
9.2
Binary Choice Models
9.2.1 Economic representations under discrete choice
Indivisible goods
Consider where one or more goods in the budget is indivisible and only available in multiples of
some basic unit. This has a big effect where the units are large and costly, in the case of durable
goods.
Recently in the literature on the economics of fertility a similar form of discreteness is
developed. Although children are not a tradable good there is an implicit price in terms of effort
and time – the demand for children is the result of a conscious choice which has an economic
dimension to it.
Discrete qualities of goods
Goods can be heterogeneous – think of these products as comprising bundles of characteristics.
These characteristics enter the utility function. The field that has received the most attention is
mode of transport choice where the competing products are alternative modes of transport for a
particular journey. The characteristics are usually in-vehicle and out-of-vehicle travel time, with
these differing between individuals. By having considered the effect of certain characteristics on
mode choice you can predict the demand of alternatives.
Other models have looked at the
demand for certain electrical products where operating cost is a major factor, choice of housing
location and other models of spatial choice.
Discrete choice as an approximation of continuous choice
The literature does show examples of where some continuous choice issue (like labour supply)
has become over-complicated by the presence of ‘kinks’ in the schedule due to tax, welfare and
legal constraints.
Choice without markets
Application in the field of public policy formation have looked at voting behaviour in the US
congress, choice of motorway routes and choice of income policy in a macro context.
45
9.2.2 Discrete choices and econometrics
Many of the decisions we make are discrete involving a very well defined number of choices.
Discrete choice models have a long history in applied economics and other social sciences. The
have become part of the mainstream tools since the 70’s and the work of McFadden following
initial work of Quandt and Tobin. Mainstream consumer theory needed to consider issues like
the ownership of durables and consumer choices under restriction in the budgets caused by tax
and welfare system non-linearities.
Firm level interest was spurred by hiring and firing
decisions, plant location and product choice.
As such, the textbook presentation of consumer
choice showed neat smooth choice paths with substitution and a continuum of choices – the real
world presents few choices, with limited substitution.
In traditional regression analysis the mean of the data was the focus of the technique – here this is
of little interest (the mean of a dummy variable is simply the frequency of occurrence). Instead
we are concerned with the probabilities of particular outcomes – as such we need to make some
assumption on the probability distribution of the model to associate a unique probability with any
value taken in the function.
As we know distribution functions are non-linear – the normal
distribution underlies the Probit model and the logistic distribution underlies the Logit model.
A more general class exists whereby the outcome can be multiple – occupational choice or the
choice of transport method would be one example where some multinomial variant is needed,
usually of the logit model. This variant makes strong assumptions on the independence across
possible outcomes, the assumption of independence of irrelevant alternatives (IIA).
When the
number of choices extend across more than three of four outcomes the econometrics becomes
intractable.
When
the
choices
represent
some
cert./PLC/Diploma/Degree etc) an ordered variant is needed.
way in which the data is collected.
46
implicit
ordering
(leaving
This often reflects the ‘grouped’
5.3
Econometric Modelling Framework
5.3.1 Set-up
In these models the dependent variable takes the value of one if a certain option is chosen by
individual i or a certain event occurs, and zero otherwise. The probability of the event is usually
assumed to be a function of explanatory variables such as
Pi  Pr yi  1  F ( x'i ),
(i  1,....., N )
(1)
where xi is a k-element vector of exogenous explanatory variables. The conditional expectation
of y given x is equal to F(x’i), which is the probability weighted sum of outcomes for y. We will
ignore for the moment the possibility of x being determined endogenously.
We require F to
follow some rules – we require it to lie in the [0,1] interval and be increasing in x’i to make a
connection to a probability statement make sense. As such we can specify F as a cumulative
distribution function.
Recall for a moment what this means. In the top diagram we observe the
frequency distribution – so for any value of X we observe the frequency with which that value
occurs. The total area under the PDF equals 1. In the bottom diagram we observe the cumulative
frequency – for the value X what we now observe is the frequency that X occurs as well as any
value less than the specific X we choose. Thus the CDF tells us the probability of a value less
than or equal to X occurring.
In the probit model we use the cumulative normal whereas the
logit model uses the logistic distribution. The CDF’s are illustrated below.
47
FREQ
f(X)
PDF
X
Cumulative Frequency
CDF
F(X)
X
48
Logit and Probit Cumulative Distributions
1
Cumulative Normal
0.75
Probability
Cumulative
Logistic
0.5
0.25
0
-3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Value of Z
Z
Cumulative Normal
Cumulative Logistic
-3
0.00135
0.0474
-2
0.02275
0.1192
-1.5
0.066807
0.1824
-1
0.158655
0.2689
-0.5
0.308538
0.3775
0
0.5
0.5
0.5
0.691462
0.6225
1
0.841345
0.7311
1.5
0.933193
0.8176
2
0.97725
0.8808
3
0.99865
0.9526
49
3
9.3.2 Linear Probability Model
We will for the moment examine a model that doesn’t make these assumptions - the linear
probability model. We have used dummy variables as explanatory variables. In this section of
the course we apply regression techniques to models where the dependent variable is binary suppose for example we wish to make predictions about how individuals vote in a referendum;
income might influence the vote so assume that higher income individuals vote ‘Yes’.
These
models effectively model the likelihood that an individual with a given income will vote ‘Yes’.
The simplest way to do this is to apply OLS techniques directly and interpret things in terms of
probability. This is known as a linear probability model. Consider a simple model where we
plan to model the probability of a person voting ‘Yes’ (the LHS Y) in a referendum on divorce as
determined by income (X).
Yi    X i   i
(2)
In this framework we define Y=1 if the person votes yes, and zero otherwise. X is a measure of
income and is a random variable with mean zero. If we take the expected value of the function
(akin to a statistical regression) we obtain;
E (Yi )    X i .
(3)
We can model the outcomes in terms of a probability. Easy - as Y can only take two values (0 or
1) so define the following:
Pi  Pr(Yi  1)
1  Pi  Pr(Yi  0)

E (Yi )  Yi Pr(Yi )
(4)
 1( Pi )  0(1  Pi )  Pi
Thus the E(Yi) describes the probability that Yi =1, or that the individual votes ‘YES’ given
information on their income. The slope of the regression line gives the effect on the probability
of voting ‘YES’ of a unit change in income. We can say some more now about the error terms in
the model.
E ( i )  (1    X i ) Pi  (0    X i )(1  Pi )  0
Solving for Pi we find that variance of the error terms can be expressed as
50
(5)
E ( i )  i2  (1    X i ) 2 Pi  (  X i ) 2 (1  Pi )
2
 Pi (1  Pi )
(6)
 E (Yi )1  E (Yi )
This clearly shows that the error term is heteroscedastic. If the Pi is close to 0 or close to 1 the
variance will be relatively low, but anywhere in between will have high variance. Solutions are
as in any model with heteroscedasticity;

Weighted Least Squares but

the predicted value of the predicted value of Y is not bound in the [0,1] interval so these
procedures mean prediction greater than 1 or less than 0 are set to near 1 or near 0 and

WLS is not efficient in finite samples.
The problem of not being bound in the [0,1] interval can be expressed in the following
diagram.
Y
1
-2
0
2
X
One interesting example comes from examining defaulting payments on bonds. This can be a
problem and it would be useful to consider a model that ‘explains’ some of the probability of
these defaults occurring. In this US local government authorities can issue bonds.
A cross
section of 35 Massachusetts communities were used in a linear probability model. The results
were
51
Y  1.96  0.029Tax  4.86 Int  0.063 AV  0.007 Dav  0.48Welf ,
where Y=0 if the municipality defaulted and 1 otherwise (so probability of default decreases as
we move from 0 to 1). Tax is the average tax rate, Int is the percentage of the budget allocated to
interest payments, AV is the % growth in property values, Dav ratio of debt to assessed valuation
and Welf is the % of budget allocated to charity, pensions etc. The tax rate variable is negative
so an increase in the tax rate of $1 per 1000 will raise probability of default by 0.029. Higher
budget shares for interest payments appear to be associated with higher default probability also.
Conversely the growth in assessed valuation of property lowers the probability of defaults, as the
tax base is growing. Finally the debt to assessed valuation is also associated with less default
which seems counter-intuitive.
9.3.3 The Logit and Probit Models
Using a variable y* as an underlying response variable which is unobservable we may write the
relationship of interest as being
y*  x'   u
(7)
What we actually observe is the following:
y  1 if y*  0
y  0 otherwise
(8)
In this formulation therefore we no longer think of x’ as being the conditional expectation of Y
given the values of x. Rather it is the conditional expectation of y* given the values of x.
Pr( y  1)  P(u   x' )  1  F ( x' )
where F is the CDF for u.
Thus
(9)
The observed values of y are realisations of some binomial process
with the probabilities given in (9) and varying from trial to trial. It is natural to consider
Maximum Likelihood as an estimation method since this is maximising the joint distribution of
the outcomes (the y’s) conditional on x as a function of the parameters. The likelihood function
can be written as
L   F ( x')1  F ( x') .
y 0
y 1
52
We specify the functional form of F in (10) on the assumption that u in (7) is distributed
logistically or normally - taking the F inbe the standard normal distribution function results in the
Probit model while taking F to be the logistic distribution brings the Logit model.
9.4
Binary Response Models
We believe that the binary decision is influenced by a set of factors gathered in a vector x so that
Pr(Y  1)  F (' x)
Pr(Y  0)  1  F (' x)
The belief is that the parameters in b reflect the impact of changes in x on the probability.
F ( X , )  ' X
- Linear Probabilit y
' X
F ( X , )   (' X ) 
 (t )dt
- Probit

F ( X , )   (' X ) 
e' X
1  e X
- Logit
In principle any continuous distribution over the real line will suffice as long as the probability
tends to one (zero) in the limit as the function tends to infinity (minus infinity).
In the probit
model we use the nomal distribution (whose CDF is denoted by ) while in the logit model we
use the Logistic (with the CDF denoted by ).
Which do we use? In the graph below we
observe how the logistic has heavier tails than the normal distribution over the same range. For
intermediate values of Z (=’x), such as between 1- and +1, the probabilities are similar. The
probability of y=0 is higher in the logistic when ’x is small, and lower when ’x is very large.
In principle we expect few differences in the predictions of the models – if our sample has very
few ones (or zeros) or has one variable showing very large influence, we might expect
differences.
53
Logit and Probit Cumulative Distributions
1
Cumulative Normal
0.75
Probability
Cumulative
Logistic
0.5
0.25
0
-3
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
3
Value of Z
The probability model is a regression, i.e. E (Y )  0[1  F ()]  1[ F ()]  F () .
Whatever
distribution is used the parameters of the model are not the ‘slopes’ or marginal effects that we
are used to considering in regression models.
More generally we see that
E (Y )  dF (' x) 

  f (' x)
X
 d (' x) 
where f(.) is the PDF that corresponds to the cumulative distribution F(.). In the case of the
normal distribution we observe therefore that
E[ y ]
 (' x) .
x
In the case of the logit model we obtain
d[' x]
e  'x

 (' x)(1  (' x))
d (' x) (1  e  'x ) 2

E[ y ]
 (' x)(1  (' x))
x
54
Thus we see that the marginal effects vary with values of x – standard practice is to estimate the
Consider the following regression function – the
value at the mean value of the regressors.
dependent variable is an indicator of whether grades improve after exposure to a PSI, a new
teaching practice in economics.
GPA is the grade point average, TUCE the score from a pre-test
of knowledge, and PSI, a dummy variable equal to one if the student was exposed to PSI.
Estimation takes place by maximum likelihood methods using the logistic and normal
distributions. Note that comparing the coefficients gives very different results – clearly therefore
the procedures are different.
However look at the computed slopes – they are practically
identical.
Variable
Logit
Probit

Slope

Slope
Constant
-13.021
--
-7.452
--
GPA
2.826
0.534
1.626
0.533
TUCE
0.095
0.018
0.052
0.017
PSI
2.379
0.449
1.426
0.468
f(’x)
0.189
0.328
Using the coefficients consider the following.
We are interested in the impact of PSI on the
grades. We can compute the following at the sample means of the x variables for the probit
specification:
PSI  0 : Pr(GRADE  1)   (7.45  1.62GPA  0.052(21.938)
)
PSI  1 : Pr(GRADE  1)   (7.45  1.62GPA  0.052(21.938)  1.4263)
55
Pr(Grade=1)
1.0
with PSI
without PSI
2
4
GPA
The probability that a student’s grade increases after exposure to PSI is far greater for students
with high GPA’s than low GPA’s. At the sample mean GPA of 3.117 the effect of PSI is 0.465,
very close to the marginal effect reported in the table.
9.5
Multinomial Models
9.5.1 Unordered Choices
The i'th consumer is faced with J choices with the utility of choice j represented by
U i , j  ' z ij   ij
Pr(U ij  U ik )k  j
As we need to estimate multiple integrals the probit is very complicated to use in this context –
however the logit model can be extended.
If Y represents the choice made, and iff the J
disturbances are independent and identically distributed, then
56
Pr(Yi  j ) 
Pr(Yi  j ) 
exp
( 'j X ji )
1  exp (1 X1i )  exp (2 X 2 i )
'
'
exp(  'j X ji )
J
1   exp
, j  1,2
, generally
( 'k X i )
k 1
The binomial logit is the special case where J=1.
The model implies that we can compute the
log-odds ratio for the J choices as
 Pr(Yi  j ) 
   j X 'ji .
log 
 Pr(Yi  0) 
By assumption the odds ratios for each choice are independent of the other alternatives, or the IIA
assumption.
9.5.2 Ordered Choices
Consider bond ratings, taste tests, opinion surveys, education choices etc.
ordered inherently.
All of these are
Multinomial logit would not take the implicit ordering into account in any
direct manner. Treating the ordering as a linear regression assumes the move from one position
to another has identical in importance (overstating the ordinal property).
Ordered probit/logit
models are useful in this regard and is built around the latent regression format of the binomial
function:
y*  ' x  
As y* is unobserved we actually observe
y  0 if y*  0
y  1 if 0  y*  1
y  2 if 1  y*   2
.
.
.
y  J if  j-1  y *
Thus
57
Pr( y  0)   (' x)
Pr( y  1)   (1  ' x)   (' x)
Pr( y  2)   ( 2  ' x)  (1  ' x)
.
.
.
Pr( y  J )  1   ( j-1  ' x)
9.6
Censored Data and the Tobit Model
Censored data occurs with great regularity in applied economics.
If we have a sample that
includes non-workers then hours of work will be censored, or values in a certain range are all
This is very common – household purchase of
transformed to (or reported as) a single value.
durables, extramarital affairs, female hours worked, arrests after prison release, expenditure on
certain commodity group – all have bee studied in the context of these variants of the literature
where the dependent variable is typically zero for a fraction of the observations corresponding to
non-workers, non-purchase or some other issue.
Although censoring is common in each of
these examples the way in which we deal with them is not.
Typically we begin by specifying a latent regression model which describes the model in the
absence of censoring. So , we may describe the latent dependent variable y* according to
yi*  xi'   ui
ui ~ N (0,  2 )
where x is a k-vector of exogenous explanatory variables and  is the parameter vector.
The different types of censoring models can be distinguished by the observability rule on y*. The
simplest is the Tobit rule –
 yi* if yi*  0
yi  
0 if yi*  0
In this case simply y* is observed when x’ + u exceeds zero.
If this was a model of desired
consumption then individual i is observed to buy when desired consumption of the good is
positive.
In micro terms this is a corner solution (although this is not necessarily the case).
Estimation of  must account for the censored nature of the data. OLS regression would be
58
biased towards zero for the slope parameters, and underestimate the impact of elements of the x
vector. In essence the regression line is being pulled towards the horizontal axis. Eliminating
the zero values and estimating on the remaining truncated sample will not get rid of this bias. If
we order the data so that the first N1 refer to individuals for whom y is positive the OLS estimator
gives
 N1

ˆ      xi xi' 
 i 1

1 N
1
x u
i 1
i i
Thus the bias in OLS (assuming x is exogenous) is given as
 N1

E (ˆ )      xi xi' 
 i 1

1 N
1
 x E (u
i 1
i
i
yi  0)
If we evaluate this further we observe that
E (ui yi  0)  E (ui ui   xi' )

xi'  
u

 where  i  i
 E   i  i  
 


 ( zi ) 

 
 1  ( zi ) 
where zi 
xi'
and  and  are the standard normal and distribution functions respectively. This

ratio is the mean of the standard normal truncated from below at z. It is often termed the hazard
or inverse Mill’s ratio since it measures the conditional expectation of individual i with
characteristics x whoe remains in the sample after truncation. Typically the IMR is referred to as
, or
E (ui yi  0) i .
The OLS bias depends therefore on the extent to which  differs from zero. In order to avoid the
biases direct use of OLS has been dropped in favour of a complicated but tractable maximum
likelihood procedure.
59
Chapter 7: Introduction to time series
So far we have been modelling cross sections, where individuals were observed only once. Time series analysis is the
analysis of an economic series over time. Time series analysis allows to answer question such as what is the causal
effect on a variable Y of a change in variable X over time? I.e. What is the dynamic causal effect on Y of a change in
X. Also, it allows you to best forecast the value of some variable at a future date. Forecasting models do not need to
be causal. Example seeing individuals carrying umbrella make you forecast rain in the future, but umbrella do not
cause rain. Regression models can produce reliable forecasts even if their coefficients have no causal interpretation.
An important difference between time series and cross section is that the ordering does matter in time series.
Definition: A sequence of random variables indexed by time is called a stochasticprocess (stochastic means random)
or time series for mere mortal. A data set is one possible outcome (realisation) of the stochastic process, if history had
been different, we would observe a different outcome, thus we can think of time series as the outcome of a random
variable.
60
61
LHUR
1999:01
1997:01
1995:01
1993:01
1991:01
1989:01
1971:04
1987:01
1985:01
1983:01
1981:01
1979:01
1959:02
1977:01
-2
1975:01
1973:01
1971:01
1969:01
1967:01
1965:01
1963:01
1961:01
1959:01
7.1 Introduction to time series and serial correlation
18
16
14
12
10
8
6
4
2
0
1984:02
1996:04
-4
Series1
Inflation rate
LHUR
12
10
8
6
4
2
0
Consumer Price Index and unemployment from 1960-1999.
Rather than dealing with individuals observation, the unit of interest in time: value of Y at date t is Y t. The unit of
time can be anything from 1 second to 1 year.
The value of Y in the previous period is called the first lag value: Y t-1. The jth lag is denoted: Yt-j. Similarly Yt+1 is
the value of Y in the next period.
The change in the value of Y between period t-1 and t is called the first difference:
  Yt  Yt 1
Changes in the value of Y rather than Y are often used because economic series tend to have a trend, say increase
over time. If 2 series are trending we could falsely conclude that one is causing the other (spurious regression).
A series has a linear trend if:
Yt   0  1t  et
Trends can also be exponentials, this is modelled as a linear trend of the log of the series.
ln( Yt )   0  1t  et
Numerous
time
series
are
analysed
after
computing
their
logarithms
or
change
in
logarithms
 ln( Yt )  ln( Yt )  ln( Yt 1 ) since they tend to exhibit growth that is approximately exponential. Over the long
run, the series grow by a certain percentage every year, and the log of the series grow in a linear way.
Also, and more technically, standard errors of series are proportional to their level, thus the standard error of a log
series is approximately constant.
To deal with trended series, it is possible to differentiate the series or to include a trend in the regression. We will
have to be careful not to be carried away when including trends in the regression, as a polynomial trend of high order
will track any series pretty well, but offer little help in finding explanatory variables affecting Y t.
Seasonality introduce patterns in the data, for example new car models are introduced at the month every year, then
for that month we observe higher sales every year. Other example is in the retail sector where sales are expected to be
higher in the run up to Christmas. Including a set of dummies for quarter (months) will account for the seasonality of
the dependent or independents variables. (remember to leave apart one quarter/month, to avoid perfect
multicollinearity).
If we want to understand the relationship between 2 or more variables over time, we need to assume some sort of
stability over time. It is best to deal with process that are stationary.
A stochastic stationary process has the same joint distribution for ( X t ,..., X t  k ) and (X t  j ,..., X t  k  j ) .
Trending series is then obviously nonstationary since the mean varies over time.
Quarter
1999:01
CPI
164.8667
Annual rate of
inflation
1.6
First lag Change in
inflation
62
1999:02
1999:03
1999:04
2000:01
2.8
2.8
3.2
4.1
166.0333
167.2
168.5333
170.2667
1.6
2.8
2.8
3.2
1.2
0
0.4
0.9
From the first to the second quarter of 1999, CPI increased from 164.87-166.03, a percentage increase of 0.704%, or
at an annual rate 0.704*4=2.8
This percentage change can also be computer using the differences of the log: ln(166.03)-ln(164.87)=0.007
In time series, the value of Y in one period is typically correlated with its value in the next period, this is called serial
correlation or autocorrelation.
1st autocorrelation is the correlation between Yt and Yt-1
2nd autocorrelation is the correlation between Yt and Yt-1.
jth autocorrelation is the correlation between Yt and Yt-j.
Similarly, the jth autocovariance is the covariance between Yt and Yt-j.
cov(Yt , Yt  j ) 
autocorrelation:
where
T
1
 (Yt  Y j 1,T )(Yt  j  Y1,T  j )
T  j  1 t  j 1
ˆ j 
(7.1)
cov(Yt , Yt  j )
(7.2)
var(Yt )
Y j 1,T denotes the sample average of Yt computed over the observations t=j+1,…T.
Assuming that the series is stationary, variance of Yt is equal to variance of Yt-1,
Say we are interested in the 1st order autocorrelation: j=1
19991
19992
19993
19994
20001
y2,5
y1,4
1.624036 3.236242
2.61377
2.83057 3.236242
2.61377
2.810681 3.236242
2.61377
3.189793 3.236242
2.61377
4.113924 3.236242
2.61377
-1.61221
-0.40567
-0.42556
-0.04645
0.877682
-0.98973
0.2168
0.196911
0.576023
1.500154
cov
corr
Over the full period, the autocorrelation is:
63
mean
2.913801
0.401507 2.913801
-0.09226 2.913801
-0.00915 2.913801
0.505565 2.913801
0.268555
0.419942
-1.28976
-0.08323
-0.10312
0.275992
1.200123
var y
1.663494
0.006927
0.010634
0.076172
1.440296
0.639504
Lag
Inflation rate
Change of inflation rate
1
0.85
-0.24
2
0.77
0.27
3
0.77
0.32
4
0.68
-0.06
Inflation is strongly positively autocorrelated, and autocorrelation decreases as the lag increases. This is the long term
trend of inflation
Changes in inflation are negatively correlated
7.3 Autoregression
This is a model that relates a variable to its past value.
7.3.1 1st order autoregressive model. AR(1)
If you want to predict the future of a time series, a good place to start is in the immediate past. An AR(1) process
takes the following form:
Yt   0  1Yt 1  ut
For an AR(1) to be stationary,
1  1 .
The covariance between yt and yt+h can easily be calculated:
y t  h   1 y t  h 1  et  h
  1 (  1 y t  h  2  et  h 1 )  et  h
  12 y t  h  2   1 et  h 1  et  h
 ...
  1h y t   1h 1 et 1  ...   1 et  h 1  et  h
Premultiplying by yt and taking expectation, we have:
cov( t , y t  h )  E ( y t , y t  h )   1h E ( y t2 )   1h 1 E ( y t et 1 )  ...  E ( y t et  h )
  1h E ( y t2 )   1h  2y
since, et+j is uncorrelated with yt
The correlation between yt and yt+h is then: corr ( y t , y t  h ) 
 1h
While all observations of yt are correlated, this correlation gets really small as h increases.
A systematic way to forecast the change in inflation,
Inf t using the previous quarter change Inf t 1 . Using OLS
and data from 1962-99, we get:
64
. reg dinf dinf_1 if daten>=19621 & daten<=19994,robust
Regression with robust standard errors
Number of obs
F( 1,
150)
Prob > F
R-squared
Root MSE
=
=
=
=
=
152
3.89
0.0504
0.0442
1.6866
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.2100019
.1064831
-1.97
0.050
-.4204024
.0003986
_cons |
.0188995
.1370652
0.14
0.891
-.2519284
.2897274
------------------------------------------------------------------------------
Change in inflation are negative related to change in the last quarter.
More generally, a AR(1) model has the following form:
Yt   0  1Yt 1  ut
Yˆt / t 1  ˆ 0  ˆ1Yt 1 is the forecast for period t made at period t-1. Then the forecast error is simply:
fe  Yt  Yˆt / t 1
 Forecast and predicted value
-
The forecast is not an OLS predicted value: OLS prediction are calculated for the observations in the
sample. In contrast, forecast is made for dates beyond the dataset used.
-
OLS residuals is the difference between the true and predicted value, whilst the forecast error is the
difference between the future value of Y and its forecast.
Root mean squared forecast error.
The root mean squared forecast error (RMSFE) is a measure of the size of the forecast error:
RMSFE  E (Yt  Yˆt / t 1 ) 2
RMSFE has 2 sources of error.
-
Error arising because future value of the error term (ut) are unknown
-
Error in estimating
 0 and 1 .
If the first source is larger than the second one then
RMSFE  var( ut ) which can be estimated by the standard
error of the regression (SER)
SER  suˆ , where s û 
2
T
T
1
SSR
2
2
ˆ
u

SSR

uˆ t
and


t
n  k  1 t 1
n  k 1
t 1
Using our data on inflation for the period 1962-99, what would be our forecast of inflation in 2000:I. Inflation change
between
1999:III
and
1999IV
was
3.2-2.8=0.4.
65
Using
our
previous
estimated
values:
 inf 2000:I  0.02  .211*  inf 1999:IV  0.02  0.211* 0.4  .006 . Our prediction of inflation for 2000:I is
thus: 3.2-0.006=3.1.
In fact inflation for that quarter was 4.1%, so we made a large error. This is not surprising since, the R2 of this model
is 0.04 so lagged change of inflation explain very little of current change in inflation. The standard error of the
regression is 1.67, so ignoring uncertainty arising from the estimation of the coefficients, RMSFE=1.67 percentage
points.
* the concept of stationarity:
Time series used value of the past to predict the future. If the future is like the past, then these historical relationships
can be used to forecast the future. But if the future differs fundamentally for the past, then those historical
relationships might not be reliable guides to the future. To make reliable forecast, time series need to be stationary.
A time series Yt is stationary if its probability distribution does not change over time, the joint distribution of
(Ys 1 ,..., Ys T ) does not depend on s.
Two times series are said to be jointly stationary if the joint distribution of
(Ys 1 ,..., Ys T , X s 1 ,..., X s T ) does not
depend on s.
7.3.2 pth order autoregressive model AR(p)
The more distant past may have independently affect current value of a variable. One way to incorporate this
information is to include additional lags in the AR(1) model.
An AR(p) model represents Yt as a linear function of its p lagged values.
Yt   0   1Yt 1   2Yt  2  ..   p Yt  p
(7.3)
Under the assumption that E (u t / Yt 1 ,..., Yt  p )  0 , the best forecast of Yt is obtained using the lag p values of Yt.
Any additional lag has no additional information.
Regression with robust standard errors
Number of obs
F( 4,
147)
Prob > F
R-squared
Root MSE
=
=
=
=
=
152
6.93
0.0000
0.2086
1.5502
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 |
-.205024
.0992921
-2.06
0.041
-.4012484
-.0087997
dinf_2 | -.3159607
.0870167
-3.63
0.000
-.4879261
-.1439954
dinf_3 |
.1977129
.0844289
2.34
0.021
.0308616
.3645642
dinf_4 | -.0358524
.0999216
-0.36
0.720
-.2333207
.161616
_cons |
.0238132
.1256191
0.19
0.850
-.2244394
.2720657
-----------------------------------------------------------------------------. testparm dinf_2-dinf_4
( 1)
( 2)
dinf_2 = 0.0
dinf_3 = 0.0
66
( 3)
dinf_4 = 0.0
F(
3,
147) =
Prob > F =
6.54
0.0004
Do F test of sig of last 3 terms
Note improvement in R2 and SER.
The forecast for inflation in 2000:I is now 3.4%.
7.4 Time series with additional predictors
Other variables and their lags can be added to improve the prediction.
A high value of unemployment tends to be associated with a future decline in the rate of inflation, this is the short-run
Philips curve. (cor=-.40)
dinf
Fitted values
5
4
3
2
dinf
1
0
-1
-2
-3
-4
-5
0
5
unemp_1
10
Including the first lag in unemployment to our previous model of inflation, we get:
Regression with robust standard errors
Number of obs
F( 5,
146)
Prob > F
R-squared
Root MSE
=
=
=
=
=
152
6.26
0.0000
0.2421
1.5223
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.2614728
.0933617
-2.80
0.006
-.4459877
-.0769578
dinf_2 | -.3950265
.0976681
-4.04
0.000
-.5880523
-.2020006
dinf_3 |
.1174997
.084659
1.39
0.167
-.0498158
.2848152
dinf_4 | -.0922981
.0965601
-0.96
0.341
-.2831343
.0985382
unemp_1 |
-.233865
.1004808
-2.33
0.021
-.4324498
-.0352803
_cons |
1.433162
.5549381
2.58
0.011
.3364128
2.529912
------------------------------------------------------------------------------
Despite significance of unemployment, the R2 only marginally improves. Our forecast of inflation fro 2000:I is now
3.7.
We can add more lags in unemployment.
The autoregressive distributed lag model (ADL(p,q)) has the following form:
Yt   o   1Yt 1  ...   p Yt  p   1 X t 1   2 X t 2  ...   q X t  q  u t
67
(7.4)
The assumption that
E (ut / Yt 1 , Yt 2 ,..., X t 1 , X t 2 ...)  0 implies that no additional lags of either Y or X
belongs in the ADL model (not significant). P and q are the true lags of the model. Additional variables and their lags
can also be added
F-stat on the 4 lag value of unemployment reject that all the coefficients are nill, so they are jointly significant, so is
the F-test on the significance of the 2nd to 4th lag.. The R2 and SER improve. Our forecast of inflation in now 3.7%.
Regression with robust standard errors
Number of obs
F( 8,
143)
Prob > F
R-squared
Root MSE
=
=
=
=
=
152
8.03
0.0000
0.3811
1.3899
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.3607111
.0921791
-3.91
0.000
-.5429208
-.1785014
dinf_2 | -.3412813
.1007598
-3.39
0.001
-.5404524
-.1421102
dinf_3 |
.0771398
.085122
0.91
0.366
-.0911203
.2453998
dinf_4 | -.0334765
.0873061
-0.38
0.702
-.2060537
.1391007
unemp_1 |
-2.72218
.4812139
-5.66
0.000
-3.673391
-1.770968
unemp_2 |
3.478651
.9014268
3.86
0.000
1.696808
5.260495
unemp_3 | -1.044318
.902434
-1.16
0.249
-2.828153
.7395158
unemp_4 |
.0667446
.448793
0.15
0.882
-.820381
.9538702
_cons |
1.331172
.4768418
2.79
0.006
.3886021
2.273741
-----------------------------------------------------------------------------. testparm
(
(
(
(
1)
2)
3)
4)
unemp_1
unemp_2
unemp_3
unemp_4
F(
. testparm
( 1)
( 2)
( 3)
unemp_1- unemp_4
=
=
=
=
0.0
0.0
0.0
0.0
4,
143) =
Prob > F =
8.41
0.0000
unemp_2- unemp_4
unemp_2 = 0.0
unemp_3 = 0.0
unemp_4 = 0.0
F(
3,
143) =
Prob > F =
10.12
0.0000
It is also possible to add more covariates and the notations get a bit complicated. To simplify them, let’s introduce the
lag notation.
The lag operator transform a variable into its lag:
LYt  Yt 1
By applying the lag operator twice, we obtain the second lag: L( LYt )  L(Yt 1 )  Yt  2  L Yt
2
This can be generalised to a the jth lag.
p
The lag polynomial function is:
a ( L)  a 0  a1 L  a 2 L  ...  a p L   a j L j
2
p
j 0
Multiplying Yt by a(L) yields:
68
p
p
 p

a ( L)Yt    a j L j Yt   a j L j Yt   a j Yt  j  a 0 Yt  a1Yt 1  ...  a p Yt  p
 j 0

j 0
j 0



Thus, an AR(p) model can be written as:
Similarly, an ADL(p,q) model is:

a( L)Yt   0  ut
a( L)Yt   0  c( L) X t 1  ut
If there is more than one additional covariate:
a( L)Yt   0  c1 ( L) X 1t 1  c 2 ( L) X 2t 1  ut
(7.5)
The number of lags can be different for each regressor, say q1 for X1 and q2 for X2.
To estimate (7.5) we need to assume the following assumptions:
-
E(ut / Yt 1 ,..., X 1t 1 ,..., X kt 1 ,...)  0
-
(Yt , X 1t ,..., X kt ) have a stationary distribution
-
(Yt , X 1t ,..., X kt ) and (Yt  j , X 1t  j ,..., X kt  j ) become independent as j gets large. This is also
referred as the weak dependence. It insures that in large samples there is sufficient randomness in the data for the law
of large numbers and the central limit theorem to hold
-
(Yt , X 1t ,..., X kt ) have nonzero, finite 4th moments
There is no perfect multicollinearity
7.5 Statistical inference, Granger causality test
* Granger causality
The Granger causality test is the F-testing the hypothesis that the coefficients on all the values of one variable in (7.5)
are zero. This null implies that these regressors have no predictive power for Yt (this variable does not cause
Yt).Granger (1969). Causality here means only that this variable allows us to make a better forecast of Y t but has no
meaning of causation. Thus, it is usually referred to as Granger causality.
Looking back at our last regression, the F-test that all the coefficients for unemployment are null, yields a F=8.41. So
at the 1% level we cannot reject that unemployment Granger cause change in inflation. Past value of unemployment
contain information that is useful for forecasting changes in the inflation rate, beyond that contained in past value of
the inflation rate.
* Forecast uncertainty
The forecast error consists of 2 components: uncertainty from the regression coefficients and uncertainty due to ut. In
the case of a ADL(1,1) process, the forecast error is:

 

YT 1  YˆT 1  uT 1  ˆ 0   0  ˆ1  1 YT  ˆ1   1  X T
69

(7.6)
Because uT+1 has conditional mean zero and is homoskedastic, var(uT+1)=  u and is uncorrelated with the expression
2
in brackets.
The Mean Squared Forecast Error is thus:


2
MSFE  E  YT 1  YˆT 1 


2
   var ˆ  
u

0
0
  ˆ
1

  1 YT  ˆ1   1  X T

To compute a forecast interval, it is convenient to assume that uT+1 is normally distributed. Then, (7.6) and the CLT
imply that the forecast error is the sum of 2 independent normally distributed terms, so that the forecast error is itself
normally distributed with variance equalling MSFE.
Then a 95% forecast interval is given by:
YˆT 1 / T  1.96SE(YT 1  YˆT 1 / T )
7.6 Determining the order of an autoregression
More lags means more information is used, but at the cost of additional estimation uncertainty
(estimating too many coefficients).
-
F-statistics
Start with a model with a large numbers of lags, test whether the coefficient on the last lag is
significant, if not reduce the number of lags and start the process again. When the true value of
the model is p, the test will still estimate the model to be >p, 5% of the time.
- Information criteria
Information criteria trade off the improvement in the fit of the model with the number of
estimated coefficients. The most popular information criteria are the Bayes Information Criteria,
also called Schwarz information criteria and the Akaike Information Criteria.
ln T
 SSR( p ) 
BIC ( p )  ln 
  ( p  1)
T
 T 
,
2
 SSR( p ) 
AIC ( p )  ln 
  ( p  1)
T
 T 
You choose the model minimizing the information criteria. The difference between the AIC and
BIC is that the term in ln T in the BIC is replaced by 2 in the AIC, so the second term (penalty for
70
number of lags) is not as large. A smaller decrease in the SSR is needed in the AIC to justify
including an additional regressor. In large sample, AIC will overestimate p with a non-zero
probability.
Similarly the optimal number of lags in the additional regressors need to be estimated. The same
methods can be used, if the regressions has K coefficients including the intercept, then the BIC is:
ln T
 SSR( K ) 
BIC ( K )  ln 
 K
T
 T

Important: all models should be estimated using the same sample:
So make sure to start with the model with the most lags, and keep this as your working sample for
this test.
In practice a convenient shortcut is to impose that all the regressors have the same number of lags
to reduce the number of models that need comparing.
7.7 Nonstationarity I: trends
If the dependent variable and/or regressors are nonstationary, then hypothesis testing and forecast
will be unreliable. There are two common type of non stationarity, with their own solutions.
7.7.1 What is a trend?
A trend is a persistent long-term movement of a variable over time.
A deterministic trend is a non-random function of time (linear in time for example).
-
A stochastic trend is random and varies over time. For example, a stochastic trend in
inflation may exhibit a prolonged period of increase followed by a period of decrease.
Since economic series are the consequences of complex economic forces, trends are
usefully thought of as having a large unpredictable, random component.
The random walk model of a trend.
The simplest model of a variable with a stochastic trend is the random walk. A time series Yt is
said to follow a random walk if the change in Yt is iid:
Yt  Yt 1  ut
where ut has conditional mean zero: E(ut | Yt 1 , Yt 2 ,...)  0
71
The basic idea of a random walk is that the value of the series tomorrow is its value today plus an
unpredictable component.
When series have a tendency to increase or decrease, the random walk can include a drift
component.
t
Yt   0  Yt 1  ut  t 0   u j
j 0
If Yt follows a random walk then it is not stationary: the variance of the random walk increases
over time so the distribution of Yt changes over time.
Var(Yt )=var(Yt-1 ) + var(ut )
For Yt to be stationary, we must have var(Yt)=var(Yt-1 ) which imposes that var(ut )=0.
Alternatively, say Y0 =0, then Y1=u1 , Y2 =u1 + u2 and more generally, Yt = u1 +u2 +…+ut .
Because, the ut are uncorrelated , var(Yt ) =tσu2
The variance of Yt depends on t and increases as t increases. Because the variance of a random
walk increases without bound, its population autocorrelations are not defined.
The random walk can be seen as a special case of the AR(1) model in which β=1, then Yt
contains a stochastic trend and a non stochastic trend. If |β|<1, then Yt is stationary as long as ut is
stationary.
For an AR(p) to be stationary involves the roots of the following polynomial to be all greater than
1. The roots are found by solving:
1   1 z   2 z 2  ...   p z p  0
In the special case of an AR(1), the polynomial is simply: 1  1 z  0  z  1
1 .
The condition that the roots are less than unity is equivalent to  1  1
If an AR(p) has a root equals one, the series is said to have a unit (autoregressive) root. If Yt has a
unit root, it contains a stochastic trend and is not stationary. (the two terms can be used interchangeably).
If a series has a unit root, the estimator of the autoregressive coefficient in an AR(p) is biased
towards 0, t-stat have a non-normal distribution, two independent series may appear related.
1) bias towards 0
72
Suppose that the true model is a random walk: ( Yt  Yt 1  ut ) but the econometrician estimates
an AR(1) ( Yt  1Yt 1  ut ).
Since the series is non stationary, the OLS assumptions are not satisfied and it can be shown that:
E ( ˆ1 )  1  5.3 / T . So with 20 years of quarterly data, you would expect ˆ1  0.934
Monte carlo with 100 replications gives:
Variable |
Obs
Mean Std. Dev.
Min
Max
-------------+----------------------------------------------------RES1 |
100
.9270481 .0570009 .7792342 1.010915
2) non normal distribution
If a regressor has a stochastic trend, then OLS t-statistics have a nonnormal distribution under the
null hypothesis. One important case in which it is possible to tabulate this distribution is in the
context of an AR with unit root; we will go back to this.
3) spurious regression
US inflation was rising from the mid-60s through the early 80’s, so was the Japanese GDP over
the same period.
reg inflation gdp_jp if daten>=19651 & daten<=19814,robust
Regression with robust standard errors
Number of obs =
68
F( 1,
66) = 113.38
Prob > F
= 0.0000
R-squared
= 0.5605
Root MSE
= 2.2989
-----------------------------------------------------------------------------|
Robust
inflation |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gdp_jp |
.1871328
.0175741
10.65
0.000
.152045
.2222207
_cons | -2.938637
.7660354
-3.84
0.000
-4.468076
-1.409198
-----------------------------------------------------------------------------. reg inflation gdp_jp if daten>=19821 & daten<=19994,robust
Regression with robust standard errors
Number of obs =
70
F( 1,
68) =
5.49
Prob > F
= 0.0221
R-squared
= 0.0797
Root MSE
= 1.5262
-----------------------------------------------------------------------------|
Robust
inflation |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------gdp_jp | -.0304821
.0130145
-2.34
0.022
-.0564521
-.0045121
_cons |
6.30274
1.378873
4.57
0.000
3.551242
9.054239
------------------------------------------------------------------------------
73
The conflicting results is that both series have stochastic trends.
15
10
5
0
-5
19600
19700
19800
daten
19900
20000
19600
19700
19800
daten
19900
20000
150
100
50
0
74
There is no reason for the 2 series to be related and the strong relationship found over the first
period is spurious and only due to stochastic trends. One special case when estimates are reliable
despite the presence of trend is when the trend component is the same for the two series, the
series are then said to be cointegrated (see section x)
7.7.2 Testing for unit root
The most commonly used test in practice is the Dickey and Fuller test.
* Dickey Fuller in the AR(1) model
In the AR(1) case, we want to test whether 1  1 , if we cannot reject the null hypothesis then Yt
contains a unit root and is not stationary (contains a stochastic trend).
However, the test is best implemented by substracting Yt-1 to both sides, it then becomes:
H0:   0 vs H1:   0
Yt   0  Yt 1  ut
where   1  1
The OLS t-stat testing   0 is called the Dickey-Fuller statistics.
Note: the test is one sided because the relevant alternative is that the series is stationary.
Regression with robust standard errors
Number of obs =
F( 1,
66) =
Prob > F
=
R-squared
=
Root MSE
=
68
4.47
0.0383
0.0852
1.7954
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------inf_1 | -.1559304
.0737577
-2.11
0.038
-.3031924
-.0086683
_cons |
1.07776
.4075892
2.64
0.010
.2639821
1.891538
------------------------------------------------------------------------------
The DF statistics does not have a normal distribution, so the critical values are specific to the test.
Table 7.1 Critical values for Augmented Dickey and Fuller test
10%
5%
1%
Intercept only
-2.57
-2.86
-3.43
Intercept and time trend
-3.12
-3.41
-3.96
So in the previous regression we cannot reject at any level of statistical confidence that   0 , so
the series has a unit root, and is not stationary.
75
* Dickey-Fuller test in the AR(p) model
For an AR(p), the Dickey Fuller test is based on the following regression:
Yt   0  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut
(7.7)
H0:   0 vs H1:   0
The ADF statistics is the OLS-t-statistics testing   0 . If H0 is rejected, Yt is stationary.
The number of p-lags needed is unknown. Studies suggest that for the ADF it is better to have too
many lags rather than too few, so it is recommended to use the AIC to determine the number of
lags for the ADF.
* Dickey Fuller allowing for a linear trend
Some series have an obvious linear trend (Japanese GDP) so it will be uninformative to test their
stationarity without accounting for the trend. Alternatively, if Yt is stationary around a
deterministic linear trend, the trend must be added to (7.7) which becomes:
Yt   0  t  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut
If H0 is rejected, Yt is stationary around a deterministic time trend.
If the series is found to have a unit root, then the first difference of the series does not have a
trend. For example: Yt   0  Yt 1  ut then Yt   0  ut is stationary.
Rq: The power of a test is equal to the probability of rejecting a false null hypothesis (1-prob
Type II). Monte Carlo have shown that UR test have low power, they cannot distinguish between
a unit root and a stationary near unit root process. Thus the test will often indicate that a series
contains a UR.
yt  1.1yt 1  0.1yt 2   t
zt  1.1zt 1  0.15 zt 2   t
Checking for UR,
1  1.1 y  0.1 y 2  0
With the first process, we have: ( y  1)(0.1 y  1)  0
y  1, y  10
76
With the second process, we have the following roots: z=0.9405, z=0.1595.
So the first process has a UR and the second one is stationary.
y
z
10
0
-10
-20
0
100
200
t
300
400
Similarly, it can be difficult to distinguish between a trend stationary and a unit root process with
drift.
wt  1  0.02t   t
xt  0.02  xt 1   t / 3
77
w
x
10
5
0
-5
0
100
200
t
300
400
In the short run, the forecast from stationary and non-stationary models will be close, however the
long term forecast will be quite different.
Also, the power of the unit root test is drastically affected by the data generating process. If we
inappropriately omit the intercept or time trend, the power of the UR test can go to 0. For
example omitting the trend leads to an upward bias in the estimated value of  in:
Yt   0  t  Yt 1   1 Yt 1   2 Yt  2  ...   p Yt  p  ut
(7.8)
Thus a procedure for UR testing can take the following form:
1- Use the least restrictive model (7.8) to test for UR.
UR test have low power to reject Ho, so if Ho is rejected there is no need to proceed
further. If not go to step 2.
2- Test   0 , if not use (7.8) to test for UR  step 1
If yes, use (7.7) to test for UR, if Ho is rejected conclude no unit root, if not, go to step 3.
3- Test   0 , if not go back to step 2,
78
p
If yes, use y y  yt 1   yt  j to test for UR.
j 1
7.8 Non stationary: Breaks
A second type of nonstationary arises when the population regression function changes over the
course of the sample
A break can arise either from a discrete change in the population regression coefficients at a
distinct date (policy change) or from a gradual evolution of the coefficients over a longer period
of time (change in the structure of the economy).
If the break is not noticed, estimates will be based on the average behaviour of the series over the
period of time and not the true relationship at the end of the period, thus forecast will be poor.
7.8.1 testing for breaks at a known date
To keep it simple, let’s consider the ADL(1,1) model. Let’s denote  the period at which the
break is supposed to have happened.
Create a dummy variable (Dt)taking values 0 before  and 1 after  . D is also interacted to Yt-1
and Xt-1.
Yt   0  1Yt 1   1 X t 1   0 Dt   1 ( Dt * Yt 1 )   2 ( Dt * X t 1 )  ut
Under the hypothesis of no break,  0   1   2  0 can be tested using a F-test. Under the
alternative of a break, at least one of these coefficients will be different from 0. This is usually
referred as a Chow test.
This approach can be modified to check for a break in a subset of the coefficients by including
only the binary variable interactions for that subset of regressions of interest.
7.8.2 Testing for break at an unknown date
Often the date of a possible break is unknown, but you may suspect the range during which the
break took place, say between  0 and  1 . The Chow test is used to test for breaks at all dates
between  0 and  1 ., then using the largest of the resulting F-statistics to test for a break at an
unknown date. This is often referred as Quandt Likelihood Ratio. Since, QLR is the largest of a
79
series of F-statistics, its distribution is special and depends on the number of restrictions tested q
(nbr of coefficients, including the intercept allowed to break),  0 and  1 , expressed as a fraction
of the total sample size. For the large sample approximation to the distribution of the QLR to be
a good one,  0 and  1 cannot be too close to the end of the sample, For this reason, the QLR is
computed over a trimmed range so that  0  0.15T and  1  0.85T .
The QLR test can detect a single discrete break, multiple discrete breaks and/or slow evolution of
the regression function. If there is a distinct break in the regression function, the date at which the
largest Chow statistics occurs is an estimator of the break date.
Say, we want to check that our estimates of the determinants of inflation in the US over the
1962:I and 1999:4 period. More specifically, we are concerned that the intercept and
unemployment may have changed over time. The first period we can check for structural break is
0.15T is 1967:4. So we create a dummy variable for observations after 1967:4 and interact it with
unemployment variables:
Source |
SS
df
MS
-------------+-----------------------------Model | 184.330595
13 14.1792765
Residual | 283.045198
138 1.91246756
-------------+-----------------------------Total | 467.375793
151 2.90295524
Number of obs
F( 13,
138)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
152
7.41
0.0000
0.3944
0.3412
1.3829
-----------------------------------------------------------------------------dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.4009554
.0824812
-4.86
0.000
-.5639484
-.2379623
dinf_2 | -.3433158
.0892349
-3.85
0.000
-.5196549
-.1669767
dinf_3 |
.0545284
.0850863
0.64
0.523
-.1136126
.2226693
dinf_4 |
-.038809
.0754606
-0.51
0.608
-.1879284
.1103105
unemp_1 | -1.719641
1.254766
-1.37
0.173
-4.199214
.7599307
unemp_2 |
3.46834
2.364168
1.47
0.144
-1.203546
8.140225
unemp_3 | -3.370699
2.164944
-1.56
0.122
-7.648893
.9074963
unemp_4 |
1.666702
1.155521
1.44
0.151
-.6167486
3.950152
D |
1.775541
1.839904
0.97
0.336
-1.860335
5.411417
D_unemp_1 | -1.225527
1.351754
-0.91
0.366
-3.896758
1.445703
D_unemp_2 |
.2032217
2.560099
0.08
0.937
-4.855847
5.26229
D_unemp_3 |
2.394236
2.370403
1.01
0.314
-2.28997
7.078442
D_unemp_4 | -1.668078
1.255425
-1.33
0.186
-4.148952
.8127955
_cons | -.2276938
1.757672
-0.13
0.897
-3.701068
3.245681
-----------------------------------------------------------------------------. testparm D-D_unemp_4
(
(
(
(
(
1)
2)
3)
4)
5)
D = 0.0
D_unemp_1
D_unemp_2
D_unemp_3
D_unemp_4
F(
=
=
=
=
0.0
0.0
0.0
0.0
5,
148) =
Prob > F =
0.85
0.5135
F=0.85, we now re-estimate this model with D=1 if t>=1968:1, and until 1993:I.
80
For example, a break at 1981:4 leads to
Regression with robust standard errors
Number of obs
F( 13,
138)
Prob > F
R-squared
Root MSE
=
=
=
=
=
152
8.42
0.0000
0.4223
1.367
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.4075559
.0932063
-4.37
0.000
-.591853
-.2232587
dinf_2 | -.3777853
.0977229
-3.87
0.000
-.5710131
-.1845574
dinf_3 |
.0515292
.0798247
0.65
0.520
-.1063085
.2093669
dinf_4 | -.0260024
.0826179
-0.31
0.753
-.1893631
.1373584
unemp_1 | -2.705181
.6911244
-3.91
0.000
-4.071744
-1.338618
unemp_2 |
3.54704
1.300035
2.73
0.007
.9764752
6.117605
unemp_3 | -2.025859
1.188034
-1.71
0.090
-4.374964
.3232453
unemp_4 |
.9846463
.5641419
1.75
0.083
-.1308334
2.100126
D | -.0729984
.9544203
-0.08
0.939
-1.960177
1.81418
D_unemp_1 | -.5718067
.8773241
-0.65
0.516
-2.306543
1.162929
D_unemp_2 |
.1754026
1.576346
0.11
0.912
-2.941512
3.292317
D_unemp_3 |
2.79729
1.599601
1.75
0.083
-.3656069
5.960186
D_unemp_4 | -2.432152
.8388761
-2.90
0.004
-4.090865
-.7734395
_cons |
1.350888
.733964
1.84
0.068
-.100382
2.802157
-----------------------------------------------------------------------------. testparm D-D_unemp_4
(
(
(
(
(
1)
2)
3)
4)
5)
D = 0.0
D_unemp_1
D_unemp_2
D_unemp_3
D_unemp_4
F(
=
=
=
=
0.0
0.0
0.0
0.0
5,
138) =
Prob > F =
3.31
0.0074
With 5 restrictions, the critical values of the QLR statistics are: 3.26, 3.66 and 4.53 at respectively
the 10%, 5% and 1% confidence interval.
So for 1981:4, we reject the null hypothesis that the coefficients on the dummy and interacted
terms are all zero, therefore at the 10% level confidence interval, we conclude that there is a
break in the series at that point for at least one of the 5 estimates.
81
7.8.3 Pseudo out of sample forecast
1) choose the number of observations P for which you will generate pseudo out of sample
forecast, say P=10%. Let’s define s=T-P
2) Estimate the regression on the shortened sample: t=1,..,s
~
3) Compute the forecast for the first period beyond the shortened sample: Ys 1|s
~
4) The forecast error : u~s 1  Ys 1  Ys 1|s
5) Repeat steps 2-4 for each date from T-p+1 to T-1 (reestimating the regression each time) .
6) The pseudo forecast errors can be examined to see if they are consistent with a stationary
relationship
For example, going back to our prediction of inflation, using data up to 1993:4, we can predict
inflation for 1994:1, doing so until 1999:4, we have 24 pseudo forecasts.
Regression with robust standard errors
Number of obs
F( 13,
114)
Prob > F
R-squared
Root MSE
=
=
=
=
=
128
7.37
0.0000
0.4210
1.4729
-----------------------------------------------------------------------------|
Robust
dinf |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dinf_1 | -.4190169
.0998416
-4.20
0.000
-.6168024
-.2212315
dinf_2 | -.3961329
.1031673
-3.84
0.000
-.6005065
-.1917593
dinf_3 |
.039491
.0844715
0.47
0.641
-.1278463
.2068283
dinf_4 | -.0449508
.0860523
-0.52
0.602
-.2154198
.1255181
unemp_1 | -2.679112
.6980463
-3.84
0.000
-4.061936
-1.296288
unemp_2 |
3.465039
1.325757
2.61
0.010
.8387247
6.091353
unemp_3 | -1.987951
1.22184
-1.63
0.106
-4.408407
.4325056
unemp_4 |
.9924426
.5769953
1.72
0.088
-.1505805
2.135466
D |
.4808356
1.389741
0.35
0.730
-2.27223
3.233901
D_unemp_1 | -.9707623
.9465191
-1.03
0.307
-2.845809
.9042847
D_unemp_2 |
.6794326
1.700203
0.40
0.690
-2.688656
4.047521
D_unemp_3 |
2.716406
1.821819
1.49
0.139
-.8926028
6.325415
D_unemp_4 | -2.525234
.9671997
-2.61
0.010
-4.441249
-.6092183
_cons |
1.414308
.7407146
1.91
0.059
-.0530417
2.881658
The inflation rate is predicted to rise by 1.9 percentage points. But the true value is 0.9, so our
forecast error is –1 percentage points.
82
Doing this 24 times, we find that the average forecast error is 0.37 which is significantly different
from 0 (t=-2.71). This suggests that the forecasts were biased over the period, systematically
forecasting higher inflation. This would suggest that the model has been unstable (break).
83
7.9 Cointegration
Sometimes, 2 or more series have the same stochastic trend in common. In this special case,
regression analysis can reveal long-run relationships among time series variables.
7.9.1 Cointegration and error correction
Series can move together so closely over the long run that they appear to have the same trend
component. For example, the 3 months and 12months US interest rate.
FYFF
FYGM3
19.1
1.73
19591
20004
daten
moreover, the spread between the two series does not appear to have a trend.
84
4
2
0
-2
20000
19900
19800
daten
19700
19600
The two series have a common stochastic trend, they are said to be cointegrated. .
Suppose Xt and Yt are integrated of order 1. If there exist a coefficient  such that Yt  X t is
integrated of order 0 (stationary), then the 2 series are said to be cointegrated with a cointegrating
coefficient  . IF the 2 series are not integrated of the same order to start with, they cannot be
cointegrated.
Unit root testing can be extended to test for cointegration. If Xt and Yt are cointegrated, then
Yt  X t is I(0) (the null hypothesis of a unit root is rejected) otherwise Yt  X t isI(1).
* Testing for cointegration when  is known.
In some cases, economic theory suggests a value of  . In this case a DF test on the series
z t  Yt  X is conducted.
In our example, let’s assume that theory suggest that  =1. There is no trend in dspread, so we
simply estimate:
. reg dspread spread_1 dspread_1 dspread_2 dspread_3 dspread_4
Source |
SS
df
MS
Number of obs =
85
163
-------------+-----------------------------Model | 20.0646226
5 4.01292452
Residual |
53.706531
157 .342079815
-------------+-----------------------------Total | 73.7711536
162 .455377491
F( 5,
157)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
11.73
0.0000
0.2720
0.2488
.58488
-----------------------------------------------------------------------------dspread |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------spread_1 | -.2506278
.0719562
-3.48
0.001
-.3927548
-.1085007
dspread_1 |
-.283247
.091436
-3.10
0.002
-.4638504
-.1026437
dspread_2 |
.0230289
.0910197
0.25
0.801
-.1567521
.20281
dspread_3 | -.0599991
.0895151
-0.67
0.504
-.2368085
.1168102
dspread_4 |
.048277
.0791148
0.61
0.543
-.1079897
.2045436
_cons |
.1548892
.063015
2.46
0.015
.0304227
.2793557
------------------------------------------------------------------------------
Lags
AIC
4
-1.049
3
-1.059
2
-1.063
1
-1.072
So our preferred model is the one with 4 lagged values of dspread. The t-stat on spread_1 = -3.48,
which is greater than the critical value (1% of the ADF) so we reject the null hypothesis that
  0 , the series does not have a unit root, and is therefore I(0). The 2 interest rate series are
cointegrated.
* testing for cointegration when  is unknown.
In general  is unknown, the cointegration coefficient must be estimated prior to testing for unit
root. This preliminary step makes it necessary to use different critical values for the subsequent
unit root test.
Step 1: estimate Yt    X t   t
(7.12)
Step2: a Dickey Fuller t-test is used to test for unit root in the residuals from (1): ̂ t
This procedure is called the Engle-Granger Augmented Dickey Fuller Test. Critical values for the
EGADF are:
Nbr of X in (7.12):
10%
5%
1%
1
-3.12
-3.41
-3.96
2
-3.52
-3.80
-4.36
Cointegrated variables
86
3
-3.84
-4.16
-4.73
4
-4.20
-4.49
-5.07
. reg dnu nu_1 dnu_1 dnu_2 dnu_3 dnu_4
Source |
SS
df
MS
-------------+-----------------------------Model | 31.2052888
5 6.24105775
Residual | 45.5212134
157 .289944035
-------------+-----------------------------Total | 76.7265022
162 .473620384
Number of obs
F( 5,
157)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
163
21.53
0.0000
0.4067
0.3878
.53846
-----------------------------------------------------------------------------dnu |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------nu_1 | -.5739985
.1150186
-4.99
0.000
-.8011821
-.3468149
dnu_1 | -.1574595
.1139771
-1.38
0.169
-.3825858
.0676667
dnu_2 |
.0752181
.1052652
0.71
0.476
-.1327006
.2831369
dnu_3 |
.0053021
.0974368
0.05
0.957
-.1871541
.1977583
dnu_4 |
.1237554
.0782992
1.58
0.116
-.0309003
.278411
_cons |
.0016953
.0421806
0.04
0.968
-.0816193
.0850099
Reject the null hypothesis of a unit root, the two series are cointegrated.
*Error correction model
If 2 series are cointegrated, their time paths are influenced by any deviation from long-run
equilibrium, if the system is in long run equilibrium, some of the variables must respond to the
magnitude of the disequilibrium. For example, if the gap between short and long term interest
rate is large, arbitragers will intervene on the market, so that the disequilibrium gap disappears. If
2 series are cointegrated, then the forecast of Yt and X t can be improved by including an error
correction term.
If Xt and Yt are cointegrated, one way to eliminate the stochastic trend is to compute the series
Yt  X t which is stationary and can be used for analysis. The term Yt  X t is called the error
correction term
Yt   0   1 Yt 1  ....   p Yt  p   1 X t 1  ...   q X t  q   (Yt 1  X t 1 )  u t
similarly, we also have:
X t   0   1 Yt 1  ....   p Yt  p   1 X t 1  ...   q X t  q   (Yt 1  X t 1 )  u t
if  is unknown, then the Error Correction Models can be estimated using ˆ t 1 .
87
Interest rate change according to stochastic shocks and previous period deviation from the longterm equilibrium Yt  X t =0. Alphas can be interpreted as the speed of adjustment.
The absence of Granger causality for cointegrated variables requires that the speed of adjustment
is 0 as well as all gammas (resp, all betas) to be 0. Of course at least one of the alphas has to be
non-zero for the 2 series to be cointegrated.
For yt to be I(0), Yt  X t needs to be I(0) since the error term and all first difference terms are
I(0), hence the 2 series are cointegrated C(1,1).
Engle and Granger cointegration procedure:
1- test the integration order of both series using DF, both series must be of the same order to
stand a chance of being cointegrated.
2- Estimate the long-run relationship: Yt    X t   t
Test ̂ for stationarity, if I(0) then y, and x are cointegrated.
3- Estimate the error correction model, since all terms are stationary, the usual test statistics
apply.
4- Assess model adequacy
88
Back to our example, we have shown that by assuming  =1, we find that the 2 series were
cointegrated.
. reg
dfyff dfyff_1-dfyff_4
dfy3m_1-dfy3m_4
spread_1
Source |
SS
df
MS
-------------+-----------------------------Model | 69.5220297
9 7.72466996
Residual |
249.54597
153 1.63101941
-------------+-----------------------------Total | 319.067999
162 1.96955555
Number of obs
F( 9,
153)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
163
4.74
0.0000
0.2179
0.1719
1.2771
-----------------------------------------------------------------------------dfyff |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------dfyff_1 | -.0014132
.2136881
-0.01
0.995
-.4235733
.4207468
dfyff_2 | -.0264828
.2208415
-0.12
0.905
-.4627751
.4098095
dfyff_3 |
.1002626
.2129522
0.47
0.638
-.3204438
.5209689
dfyff_4 |
.1444413
.1802188
0.80
0.424
-.2115972
.5004798
dfy3m_1 |
.0068489
.2541142
0.03
0.979
-.4951767
.5088745
dfy3m_2 | -.1758844
.275382
-0.64
0.524
-.7199263
.3681576
dfy3m_3 |
.2220654
.2653096
0.84
0.404
-.3020777
.7462086
dfy3m_4 | -.3159166
.2272404
-1.39
0.166
-.7648506
.1330174
spread_1 | -.4598352
.1585354
-2.90
0.004
-.7730361
-.1466342
_cons |
.2955998
.1381308
2.14
0.034
.0227098
.5684897
------------------------------------------------------------------------------
the lag spread does help to predict change in interest rate in the one year treasure bond rate.
89
Chapter 9: Limited Dependent variable
Limited dependent variables are variables that:
-
only take 2 values (working / not working),
-
have no specific order (black / white), (membership / no membership)
-
take a limited number of discrete values (number of children)
-
are not numerical ( mode of transport: walk, bus, cycle, car)
8.1 The linear probability model
We are interested in the determinant of staying in post compulsory education among children
aged 16-17. We observe whether they are currently receiving some schooling, but not the total
amount of schooling they will receive. This outcome is clearly binary and we believe it to be a
function of paternal pay.
. reg stilled lndadpay,robust
Regression with robust standard errors
Number of obs
F( 1, 15686)
Prob > F
R-squared
Root MSE
=
=
=
=
=
15688
398.77
0.0000
0.0267
.45686
-----------------------------------------------------------------------------|
Robust
stilled |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lndadpay |
.1250107
.0062602
19.97
0.000
.11274
.1372814
_cons |
-.070091
.0388235
-1.81
0.071
-.1461895
.0060075
------------------------------------------------------------------------------
So when dad log pay =5, still ed=0.55. How can we interpret this coefficient? If father has an pay
of 5 log points, the probability of still being in education is estimated to be 55%.
90
still in education
Fitted values
still in education
1
.5
0
0
5
lndadpay
10
The population regression function is the expected value of Y given the regressors X:
E (Y / X 1 , X 2 ,..., X k )  Pr(Y  1 / X 1 , X 2 ,.., X k )
For binary variables, the predicted value from the population regression is the probability that
Y=1 given X. In the context of binary variable, this model is called the linear probability model.
 1 is the change in the probability that Y=1 associated with a unit change in X1.
Yˆ is the predicted probability that the dependent variable equals 1.
So if father’s income increases from 5 to 6, the probability of staying in post compulsory
education increases by 12.5 percentage points (statistically significant).
This model has all the characteristics of OLS models, since the errors are always heteroskedastic,
it is essential to use heteroskedasticity robust standard error in the regression.
91
Residuals
.910254
-1.02793
0
5
lndadpay
10
Note: for linear probability model, the R2 is meaningless. In the continuous OLS model, it is
possible to imagine a situation were all the points are exactly on the regression line and therefore
R2 equals one. However, in the linear probability model this is not possible since Y can only take
0 or 1.
What is the effect of gender on staying on, holding father’s income constant?
Regression with robust standard errors
Number of obs
F( 2, 15685)
Prob > F
R-squared
Root MSE
=
=
=
=
=
15688
251.92
0.0000
0.0329
.45542
-----------------------------------------------------------------------------|
Robust
stilled |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lndadpay |
.1245059
.0062399
19.95
0.000
.1122751
.1367368
sex | -.0729526
.0072599
-10.05
0.000
-.0871829
-.0587224
_cons | -.0288025
.0388944
-0.74
0.459
-.1050401
.047435
------------------------------------------------------------------------------
Males are 7 percentage points less likely to remain past compulsory schooling age. (statistically
significant).
Shortcoming of the linear probability model:
92
Looking back at our first figure, the linear model predicts value of the probability lower than 0
and greater than 1, which is non-sensical.
Thus specific models to deal with limited dependent variables have been implemented.
9.2 Probit and Logit regression
Probit and logit are non linear regressions. Because we constrain the dependent variable to be
between 0 and 1, it makes sense to adopt a non-linear formulation forcing the predicted values to
be between 0 and 1. Cumulative probability distribution functions (cdf) have such properties.
Probit is based on the normal cdf
Logit is based on the logistic cdf.
Say that we have an underlying specification:
Y '   0  1 X  
Y’ is not observed, all we know is the following dichotomous variable Y, such that:
Y  0 if Y'  Yc'
Y  1 otherwise
9.2.1 Probit regression
The probit model has the following form:
Pr(Y  1 / X )   (  0   1 X )
(9.1)
Where  is the cumulative normal distribution function.
93
logit
probit
1
.5
0
-2
-4
0
temp
4
2
This function is therefore clearly non linear. This makes the interpretation of the estimated
coefficients quite difficult.
Coefficients from probit and logit models are obtained by maximum likelihood (see below), for
the moment just pretend that we have estimated the following output.
. probit stilled lndadpay
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood = -9729.7596
log likelihood = -9517.2802
log likelihood =
-9517.12
Probit estimates
Log likelihood =
-9517.12
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
=
=
=
=
15688
425.28
0.0000
0.0219
-----------------------------------------------------------------------------stilled |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lndadpay |
.3601607
.0175923
20.47
0.000
.3256805
.3946409
_cons | -1.681938
.106433
-15.80
0.000
-1.890543
-1.473333
All we can safely say here is that ln dad pay has a significant positive effect on the probability of
staying on in education.
To interpret coefficients, it is usually advocated to calculate the estimated (change in)
probabilities.
94
1) calculate: z  ˆ 0  ˆ1 X
2) looking up the value of z, in the normal distribution table
3) redo step 1 and 2, with X+1.
Here in this example, we find that, when log dad wage =5: z=-1.68+.361*5=0.125
P(Y=1/X)=  (0.125)  0.54
If dad’s pay is now 6, we find z=0.486  P=0.68
As in the OLS case, probit estimate will be biased if determinants of Y that are correlated with X
are not included in the regression. So in general, you will estimate a multivariate probit.
The difficulty comes with the interpretation of these coefficients.
1) calculate: z  ˆ 0  ˆ1 X 1  ˆ 2 X 2 . Say you are interested in the effect of X1 on Y. You
will then fix X2 to a specific value (usually, it’s sample mean).
2) Look up the value for z in the normal distribution table
3) Change X1 to X1+1, but keep X2 to the fixed value used in step 1.
4) Look up the value for z
Example:
Iteration 0:
Iteration 1:
Iteration 2:
log likelihood = -9729.7596
log likelihood = -9467.9643
log likelihood = -9467.6864
Probit estimates
Log likelihood = -9467.6864
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
95
=
=
=
=
15688
524.15
0.0000
0.0269
-----------------------------------------------------------------------------stilled |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lndadpay |
.3599531
.017623
20.43
0.000
.3254127
.3944935
sex | -.2109057
.0212435
-9.93
0.000
-.2525422
-.1692692
_cons | -1.567503
.1071921
-14.62
0.000
-1.777596
-1.357411
------------------------------------------------------------------------------
We are interested in the effect of gender on the probability of staying on. All we can say for the
moments is that boys are significantly less likely to stay than girls. But by how much…
Traditionally, we will estimate this effect for an individual with the mean characteristics on the
other variables:
su lndadpay if e(sample)
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------lndadpay |
15688
6.069131
.6056018
1.278584
9.497148
so for girls we have:
z=-1.567+.3599*6.069-.211*0 = 0.617
P(Y=1/girl)=73%
whilst for boys:
z=-1.567+.3599*6.069-.211*1 = 0.406
P(Y=1/boy)=66%
At the dad mean wage, the difference in the probability of staying on between boys and girls is 7
percentage points
If we had estimated these probabilities for poorer families (say dad ln pay = 3), then we would
have concluded:
z=-1.567+.3599*3-.211*0 = -0.487 P(Y=1/girl)=31%
z=-1.567+.3599*3-.211*1 = -0.698 P(Y=1/boy)=24%
That the difference is 7 percentage points but from a lower base. The gender dummy shifts the
probability of staying on in post compulsory education.
9.2.2 Logit regression
the logit model is similar to the probit model, with the exception that the cumulative normal
distribution is replaced by the cumulate logistic distribution.
The logistic distribution is:
96
Pr(Y  1 / X 1 , X 2 ,.., X k )  F (  0   1 X 1  ...   k X k )

1
1  exp[ (  0   1 X 1  ...   k X k )
As for the probit, the coefficients of the logit model are estimated by maximum likelihood. The
ML estimator is consistent and normally distributed in large samples, so that the t-statistics and
confidence intervals can be constructed the usual way. Once again, coefficients have no easy
interpretation and change in the probabilities must be calculated to be informative.
Probit and logit models give almost identical estimates apart from the tails of the distribution.
It can be shown that the coefficients of probit and logit models have the following relation:
 l  1.66  p
Logit estimates
Number of obs
LR chi2(2)
Prob > chi2
Pseudo R2
Log likelihood = -9471.3283
=
=
=
=
15688
516.86
0.0000
0.0266
-----------------------------------------------------------------------------stilled |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------lndadpay |
.5911763
.0300632
19.66
0.000
.5322536
.650099
sex | -.3515223
.035303
-9.96
0.000
-.420715
-.2823297
_cons | -2.584355
.1822928
-14.18
0.000
-2.941642
-2.227068
------------------------------------------------------------------------------
so the estimated staying on probability for a girl with a dad at the average log pay is:
pr(Y=1/girl)=F(-2.584+.591*6.07)=F(1.003)
Pr(Y=1/girl)=1/(1+exp(-1.003))=73%
as was found using the probit model.
While probit are easier to understand since they rely on the normal distribution, logit model were
quicker to calculate, which explained their popularity.
9.3 Estimation of logit and probit models
Probit and logit are non linear and therefore cannot be estimated using OLS. Instead they are
estimated using maximum likelihood estimation.
The likelihood function is the joint probability distribution of the data, treated as a function of the
unknown coefficients. The maximum likelihood estimator (MLE) consists of the values of the
coefficients that maximize the likelihood function. MLS chooses the unknown coefficients to
maximize the likelihood function, MLE chooses the values of the parameters that maximize the
97
probability of drawing the data that are actually observed. MLE are the parameters values most
likely to have generated the data.
9.3.1 MLE for n iid Bernouilli random variables
Say that we have n observations on a Bernouilli random variable, because processes are
independently distributed, the joint distribution is the product of the individual distributions.
Thus: Pr(Y1   1 , Y2   2 ,..., Yn   n )  Pr(Y1   1 )... Pr(Yn   n )
The Bernouilli distribution is such that: Pr(Y   )  p  (1  p)1
 [ p  1 (1  p) (1 1 ) ] *[ p  2 (1  p) (1 2 ) ] * ... *[ p  n (1  p) (1 n )
Hence:
 p ( 1 ...  n ) (1  p) n( 1 ...  n )
(9.3)
The maximum likelihood estimator of p is the value of p that maximizes the likelihood function
in (9.3).
n
Let S   Yi , then the likelihood function is:
i 1
f Bernouilli( p, Y1 ,.., Yn 0  p S (1  p) n S
It is usually easier to maximize the log likelihood, which gives the same maximum since log is a
monotonic function, so the log likelihood is:
log f ( p, Y1 ,..Yn )  S ln( p)  (n  S ) ln( 1  p)
And its derivative is:
d
S nS
ln( f )  
dp
p 1 p
A maximum is obtained when this derivative equals 0, so we get pˆ MLE  S / n
9.3.2 MLE probit and logit
Assuming the n observations are independent, the likelihood function is given by:
n
n
i 1
i 1
ln( f probit )   Yi ln[  (  0  1 X 1i  ...   k X ki ]   (1  Yi ) ln[ 1   (  0  1 X 1i  ...   k X ki ]
There is no easy way to solve this equation, so the probit likelihood function must be maximized
using a numerical algorithm.
Similarly, the logit likelihood function must be maximized numerically.
98
The goodness of fit of MLE is given by the pseudo R2
pseudoR  1 
2
max
ln( f probit
)
max
ln( f bernouilli
)
max
max
Where f probit
is the value of the maximized probit likelihood and f bernouilli
is the value of the
maximized Bernouilli likelihood (a probit model excluding all Xs).
99
Download