Mysterious Endogeneity
Haiyan Wang
Zach Andersen
11/18/2014
• Introduction of path analysis
• Review of PLS
• Model justification
• Review of lavaan package
• User Example of matrix form
• Simulation
2
– Nonrecursive models (causal loops)
• OLS results greatly biased
– Recursive (one direction) (recall PLS-path models)
• OLS and path analysis both work fine
• We will focus on the nonrecursive case since no one else has so far
3
– Method
• Instrumental Variables
• Implied covariance matrix
• Both above work equally well
– We will cover implied covariance matrix since this is a multivariate course
4
• PLS Goals
• “uncover a common structure among blocks of variables.” [2]
• No covariance structure: Does not assume a ground truth
(focuses on what the data tells you)
• Does not seek causal relationship, only relationships
• What does PLS do
• “obtain score values of latent variables for prediction purposes” [2]
• 1. From Tim and Jennifer’s slides
• 2. Gaston “Partial Least Squares with R”
• SEM Goal
– “Test and estimate the (causal) relationships among observable measures and non-observable theoretical (or
latent) variables” [1]
• What does SEM do
– Seeks to approximate a ground truth by fitting a
covariance model to observed covariances
• [1] Jiyoon and Kiran SEM presentation
• [2]. Gaston “Partial Least Squares with R”
6
• Goal: “determines whether your theoretical model successfully accounts for the actual relationships in the sample data” (1)
– Like SEM unlike PLS Path Analysis
• What does path analysis (manifest) do
– Fits a covariance model: seeks approximation of ground truth [2]
• Like SEM unlike PLS Path Analysis
– Uses manifest variables
• Unlike either SEM PLS or Path Analysis
• 1: “A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation
Modeling” by Larry Hatcher
• Uses the implied covariance matrix as the link between your data and your model, [3]
– Implied covariance matrix relates model to your data’s observed variances and covariances
• The estimated parameters are those that make the observed variance and covariances match as closely as possible to those of the model
• [3] Source: http://www.sagepub.com/upmdata/39916_Chapter2.pdf
Intelligence
Work
Performance
Supervisory
Support
Work Place
Norms
Motivation
1: “A Step-by-Step Approach to Using SAS for Factor Analysis and Structural Equation
Modeling” by Larry Hatcher Figure 4.3
• Uses only manifest variables (no latent variables)
• Allows user to specify exogenous variables effects
(single arrows) on endogenous variables
• Allows user to specify covariance between antecedent variables (double arrows)
• Allows recursive (one direction) and nonrecursive (>1 direction)
• “A Step-by-Step Approach to Using SAS for Factor Analysis and Structural
Equation Modeling” by Larry Hatcher
• Causal model must have enough equations to solve for unknown parameters
– Otherwise an infinite number of solutions
• Sufficient observations: Ugly rule of thumb 5 observations for every parameter to be estimated
• “A Step-by-Step Approach to Using SAS for Factor Analysis and Structural
Equation Modeling” by Larry Hatcher
• Model is written as simultaneous equations from path diagram
– One per endogenous variable
• Over-parameterized Model
– (# of Parameters > # of equations and no unique solution)
• Just-identified Model
– (# of Parameters = # of equations and have unique solution)
• Under-parameterized Model
– (# of Parameters < # of equations)
Use weighted least square method or ML method to find some solutions that make the two sides of the equations close enough http://www.sagepub.com/upm-data/39916_Chapter2.pdf
12
• Model 1
– X+Y=2 (2 param. & 1 eq.)
– Over-parameterized model: infinite solutions
• Model 2
– X+Y=2; X-Y=10 (2 param. 2 eq.)
– Just-identified model: one solution
• Model 3
– X+Y=2; X-Y=10; 2X+Y=5 (2 param. 3 eq.)
– Under-parametrized model: can only approximate
13
Jiyoon and Kiran SEM presentation
SAS 13.1 Users Guide: CALIS procedure
• Our model specification
DispInc, FoodCons, FoodCostRatio, RatioPrecYear, Year
Q = FoodCons
P = FoodCostRatio
D = DispInc
F = RatioPrecYear
Y = Year data.k = data.frame(Q,P,D,F,Y) econ.mod = 'Q ~ P + D
+ P ~ Q + F + Y
+ Q ~~P ' fit <- sem(econ.mod, data=data.k)
• Summary function Output Discussion
– Degrees of freedom
– Regression estimates (direct and indirect effects)
– Variances
– R squared
• Chi-square likelihood ratio test
• Supply-and-demand model Example (SEM)
• Simulation
18
When a researcher has a model in his mind, he always ask himself a question.
Is my model good enough? How can I test if my model is good or not?
19
Like what Dr. Westfall said in the ISQS 5347 class that model produces data. A good model should produce data that is close to the real data. This implies that we can test the null hypothesis:
Σ=Σ(λ)
The Chi-square likelihood ratio test is one of the method we can use for testing. (Dr.Wesfall ISQS 6348 class on 10/14/2014)
20
• Our goal is to test the null hypothesis: Σ=Σ(λ), where Σ is the observed covariance matrix
(unrestricted model) , λ is a vector of the parameters to be estimated, and Σ(λ) is the covariance matrix implied by our model
(Restricted model).
21
The null probability distribution of the test statistic can be approximated by a Chi-square distribution with (df1 − df2)degrees of freedom, where df1 and df2 are the degrees of freedom of unrestricted Σ model and restricted model
Σ(λ), respectively.
22
• In other words, the number of degree freedom of the unrestricted model is the number of equations we have.
• The number of degree freedom of the restricted model is the number of parameters in our model.
23
Example: Simultaneous Equations with Mean
Structures and Reciprocal Path
The supply-and-demand food example of Kmenta
(1971,pp.565,582).
Q t
demand = α
1
+ β
1
P t
+ ϒ
1
D t
+ E
1
(1)
Q t
supply = α
2
+ β
2
P t
+ ϒ
2
F t
+ ϒ
3
Y t
+ E
2
(2) for t=1,...,20.
Q t
demand = Q t
supply
The model is specified by two simultaneous equations containing two endogenous variables Q and P, and three exogenous variables D,F, and Y. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCQQFjAB&url=https%3A%2F%2Fsupport.sas.co
m%2Fdocumentation%2Fonlinedoc%2Fstat%2F131%2Fcalis.pdf&ei=eNE-VP7nFu6j8gG0ioCQCg&usg=AFQjCNGooKor9lXTZaS-
CBkX9VA5ewEKVQ&bvm=bv.77648437,d.b2U
24
To estimate this model, each endogenous variable must appear on the left-hand side of exactly one equation. Rewrite the second equation as an function for P t
as:
P t
= −
α
2
β
2
+
1
β
2
Q t
−
ϒ
2
β
2
F t
−
ϒ
3
β
2
Y t
+
1
β
2
E
2 or, equivalently reparameterized as:
P t
= θ
1
+ θ
2
Q t
+ θ
3
F t
+ θ
4
Y t
+ E
2
(3)
25
C?
C?
C?
D t
F t
Y t
P? ϒ
1
P? θ
3
P? θ
4
Θ
2
Q t
P?
P t
Var?
E
1
β
1
P?
Var?
E
2
E
1 and E
2 are the error terms.
We will have 8 parameters to be estimated from our restricted model.
Cov?
26
Number of equations= (p ( p + 1 ) ) / 2 p=the number of manifest variables
By Dr. Westfall notes
27
• In our supply-and –demand example, we have 5 manifest variables(Q,P,D,F,Y) , so the number of equations will be 5*6/2=15.
• But 6 of the equations involve variances and covariances among exogenous manifest variables that are not explained by any model. So the total number of equations are 9.
28
Observed covariance (from unrestricted model)
Σ =
Σ
YY
.
Σ
YX
… .
…
Σ
XY
.
Σ
XX
=
σ
σ
PQ
σ
σ
QP
PP
.
σ
.
σ
QD
PD
σ
σ
QF
PF
… … … … … …
σ
DQ
σ
FQ
σ
YQ
σ
σ
σ
DP
FP
YP
.
.
σ
.
σ
σ
DD
FD
YD
σ
σ
σ
DF
FF
YF
σ
QY
σ
PY
σ
σ
DY
FY
σ
YY
These are the 6 equations not include in counting total number of equations
29
Now, you know we have 8 parameters and 9 equations and probably you have already figured out that this example is the underparamterized model case.
But you may curious about what those
“equations” look like and what is the mysterious behind Σ=Σ(λ).
30
The general matrix representation of simultaneous equation models is :
Y=ΒY + ΓX+E http://www.sagepub.com/upm-data/39916_Chapter2.pdf
31
Table1. Notation for Simultaneous Equation Models
Vector/Matrix
X
E
Variables
Y
Coefficients
Γ
Definition
Endogenous
Exogenous
Disturbance(error) terms p×1
Coefficient matrix for exogenous variables; direct effects of X on Y
Dimensions p×1 q×1 p×q
Β Coefficient matrix for endogenous variables; direct effects of Y on Y p×p
Covariance matrices
Φ Covariance matrix of X q×q
ψ Covariance matrix of E p×p
32
Rewrite equation (1) and (3) in matrices form:
Q
P t t
=
θ
0 β
1
2
0
Q
P t t
+
ϒ
0
1
θ
0
3
θ
0
4
D t
F t
Y t
+
E
E
1
2
Y Β Y Γ X E
33
Y = ΒY + ΓX + E
Step1: (IΒ)Y = ΓX + E
Step2: Let C=(I-B), the C Y = ΓX + E
Step3: CC −1 Y = C −1 ΓX + C −1 E
Y = C −1 ΓX + C −1 E (Reduced form)
Y
Step4: … =
C −1
…
Γ . C
.
−1
. . .
X
…
X I .
𝟎 E
34
Step5: Σ(λ)=
Y
Cov … =
C −1 Γ . C −1
… .
. . .
X I .
𝟎
X
∗ Cov … ∗
C −1 Γ . C −1
… .
. . .
E I .
𝟎
′
(Looks familiar?)
35
Y
Cov … =
C −1 Γ . C
… .
−1
. . .
X
∗ Cov … ∗
C −1
…
Γ . C
.
−1
. . .
′
X I .
𝟎 E I .
𝟎
=
C −1 Γ . C
… .
−1
. . .
∗
Σ xx
.
𝟎
… .
…
I .
𝟎 𝟎 . Σ
EE
∗
C −1 Γ . C
… .
−1
. . .
′
I .
𝟎
=
C
−1
ΓΣ xx
…
Σ xx
.
.
.
C
−1
ΓΣ
EE
… ∗
C
−1
Γ .
C
−1
… .
. . .
′
𝟎 I .
𝟎
=
C −1 ΓΣ xx
Γ ′ C −1′
…
+ C −1 Σ
EE
Σ xx
①
Γ ′ C −1′
③
C −1′
.
.
.
②
.
. C −1 ΓΣ
EE
…
Σ xx
④
36
Structures Behind the Fitted Covariance Matrix
Y
Cov … =
X
C −1 ΓΣ xx
①
Γ ′
Σ
C −1′ xx
+ C −1 Σ
EE
…
Γ ′ C −1′
3×2
③
C −1′
2×2
=
σ
σ
PQ
σ
FQ
σ
QP
σ
PP
σ
DP
σ
FP
.
σ
QD
.
σ
PD
.
σ
DD
.
σ
FD
σ
QF
σ
PF
σ
DF
σ
FF
σ
QY
σ
PY
…
σ
DQ
… … … … …
σ
YQ
σ
YP
.
σ
YD
σ
YF
σ
σ
DY
FY
σ
YY
.
.
.
.
.
C −1
②
Σ
ΓΣ
…
EE 2×3 xx 3×3
④
There are three equations hidden in ① and six equations hidden in ②
37
For instance:
C
−1
ΓΣ xx
Γ
′
C
−1′
+ C
−1
Σ
EE
C
−1′
2×2
=
𝐵
𝐵
11
21
𝐵
12
𝐵
22
𝐵
11
= ( 𝛾
1 𝛽
1
2 𝜃
3
2 𝜎
𝐹𝐹
2 𝜎
𝐷𝐷
+ 𝛽
1
+ 𝛾
1 𝛽
1 𝜃
3
2 𝜃
3 𝜃
4 𝜎
𝑌𝐹 𝜎
𝐹𝐷
+ 𝛾
+ 𝛾
1 𝛽
1
1 𝛽
1 𝜃
4 𝜃
4 𝜎
𝐷𝑌 𝜎
𝑌𝐷
+ 𝛽
+ 𝛾
1
2
1 𝛽
1 𝜃
3 𝜃
4 𝜃
3 𝜎 𝜎
𝐹𝑌
𝐷𝐹
+
+ 𝜎
𝐸
1
𝐸
1 𝛽
1 𝜎
𝐸
2
𝐸
1
+ 𝛽
1 𝜎
𝐸
1
𝐸
2
+ 𝛽
1
2 𝜎
𝐸
2
𝐸
2
)/(1 − 𝛽
1 𝜃
2
) = 𝜎
𝑄𝑄
+
(EQUATION 1)
Note: 𝜃
3
= − 𝛾
2 𝛽
2
, 𝜃
4
= − 𝛾
3 𝛽
2
, 𝜃
2
=
1 𝛽
2
If you are interested in the other 8 equations, it’s on my scratch paper. I would like to share with you after class.
38
Degree of freedom =
# 𝑜𝑓 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛𝑠 − # 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠
In the supply-and-demand example, the degree of freedom 9 − 8 = 1 .
Let’s use the R output to check if we calculated the df correctly.
39
40
41
• Check Σ(λ)= Σ
42
• In this supply-and-demand model example since we don’t know the true model, it’s seems hard to say our estimated model is the best model.
• What is the best way to check?
Simulation!!
43
True model: y
2
= 1.0 ∗ y
1
+ e
2 y
1
= 0.5 ∗ y
2
+ 1.0 ∗ x + e
1 http://courses.ttu.edu/isqs6348-westfall/images/6348/simeqnbias.htm
44
x
P? 1.0
P?
1.0
y
1 y
2
0.5 P?
Var?
Var?
e
2 e
1
In this simulation example we have 5 parameters to be estimated.
45
Rewrite the simultaneous equations as matrices form:
Y=ΒY + ΓX+E y y
1
2
=
0 0.5
×
1.0
0 y
1 y
2
+
1
0
x + e
1 e
2
46
Y = C −1 ΓX + C −1 E (Reduced form) y y
1
2
=
2
2
x +
2 1
2 2 e e
1
2 y
1
= 2.0 ∗ x + 2.0 ∗ e
1
+ e
2 y
2
= 2.0 ∗ x + 2.0 ∗ e
1
+ 2.0 ∗ e
2
We will use the reduced form to simulate our data. (See in the R code)
47
R code of simulation:
##Simulate the data set of x and residuals e1 = rnorm(10000,0,1) e2 = rnorm(10000,0,1) x = rnorm(10000,0,1)
##Use simulated x and residuals to run the reduced form model and get the data set of y1 andy2.
y1 = 2*x + e1 + 2*e2 y2 = 2*x + 2*e1 + 2*e2
48
Σ(λ)=
Y
Cov … =
C −1 Γ .
C −1
… .
. . .
X I .
𝟎
X
∗ Cov … ∗
C −1
…
E I
Γ .
C −1
.
.
. . .
𝟎
′
=
C
−1
ΓΣ xx
①
Γ
′
C
Σ
−1 ′ xx
+ C
−1
Σ
EE
…
Γ ′ C −1 ′
1×2
③
C
−1 ′
2×2
.
.
.
.
.
C
−1
②
ΓΣ
EE 2×1
…
Σ xx 1×1
④
=
Σ
YY
.
Σ
YX
… .
…
Σ
XY
.
Σ
XX
=
σ y
11
σ y
…
21
σ xy
1
σ y
12
σ y
…
22
σ xy
2
.
σ xy
.
σ xy
… …
.
σ xx
1
2 = 𝚺
(Notice the df=5-5=0, just-identified)
49
C −1 ΓΣ xx
Γ ′
=
C −1′ + C
2 1
2 2
−1
1
0
𝜎
Σ
EE
𝑋𝑋
C −1′
2×2
1 0
2 2
1 2
+
2 1
2 2 𝜎 𝜎 𝑒
11 𝑒
21 𝜎 𝜎 𝑒
12 𝑒
22
2 2
1 2
=
2 1
2 2
1
0
1 1 0
2 2
1 2
+
2 1
2 2
1 0
0 1
2 2
1 2
=
9 10
10 12
C −1 ΓΣ
EE 2×1
=
2 1
2 2
1
0 𝜎 𝜎 𝑒
11 𝑒
21 𝜎 𝜎 𝑒
12 𝑒
22
=
2 1
2 2
1
0
1 0
0 1
=
2
2 𝜎
𝑋𝑋
= 1 and 𝜎 𝜎 𝑒
11 𝑒
21 𝜎 𝑒
12 𝜎 𝑒
22
=
1 0
0 1
(Based on the simulation assumptions)
50
Y
Σ (λ)=Cov … =
X
C −1 ΓΣ xx
①
Γ ′
Σ
C −1′ xx
+ C −1 Σ
EE
…
Γ ′ C −1′
1×2
③
C −1′
2×2
.
.
.
.
. C −1
Σ
②
ΓΣ
EE 2×1
… xx 1×1
④
9 10 2
= 10 12 2 =
σ
σ y y
11
21
σ y
σ y
12
22
.
.
σ
σ xy xy
… … … …
2 2 1
σ xy
1
σ xy
2
.
σ xx
1
2
= 𝚺 ( Is this true?)
51
Fitted Covariance matrix
Observed Covariance matrix
52
The null hypothesis is Σ(λ)= Σ, the difference can be explained by chance alone and increase the number of simulation will make the difference smaller and smaller (by Law of Larger Numbers).
However, one of the important thing in this simulation example is that the df=0, which we cannot use the chi-square test to test the model.
When the model is just identified, even it’s wrong, there is no way to test it.
By Dr. Westfall
53
Table2. Results comparison of OLS, SEM, and True model
Dependent
Variable 𝐲
𝟐
= 𝐲
𝟏
=
OLS SEM True
1.112
∗ y
1
(0.003)
0.746
∗ y
2
+ 0.512 ∗ x
(0.002) (0.009)
0.997
∗ y
1
(0.005)
0.507
∗ y
2
+ 0.990 ∗ x
(0.007) (0.017)
1.0
∗ y
1
0.5
∗ y
2
+ 1.0 ∗ x
OLS is always thinking as biased estimates of simultaneous equations which has endogeneity problem. The above comparison clearly explained why OLS estimate is biased and SEM is a better model.
54
You may also curious about what will happen if we have more parameters than equations?
55
56