Lecture 8: Spurious effects, casual modeling

advertisement
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
CONSTRUCTED VARIABLES 1.0 – BRANCH QUESTIONS
I.
LSTRAIL
LSTRAILX
 BRANCH question
[ASK: IF YES]
In the past year, has any adult in your
household gone to a trail for walking,
hiking or bicycling?
1
Yes
5,334
2
No
3,795
8
DK
26
How many times in the past year did any
adult from your household use a trail for
walking, hiking or bicycling?
Once
2 2-6 Times
3 7-11 Times
4 About 1/month
5 13-40 Times
6 Almost 1/wk
7 About 1/wk
8 53+ Times
98 DK
Total
1
9,155
ASKED: 1992, 1993, 1996
165
1,236
 Data set doesn’t include
0 times
398
243
642
 Codes are not interval
169
118
467
69
3,507
ASKED: 1993, 1996
TRAILuseSCALE -- part way done
TRAILuseSCALE -- interval
0
zero
3,795
0
zero
3,795
1
Once
165
1
Once
165
2
2-6 Times
1,236
4
2-6 Times
1,236
3
7-11 Times
398
9
7-11 Times
398
4
About 1/month
243
12
About 1/month
243
5
13-40 Times
642
20
13-40 Times
642
6
Almost 1/wk
169
20
Almost 1/wk
169
7
About 1/wk
118
20
About 1/wk
118
8
53+ Times
467
20
53+ Times
467
69
M
DK
98
DK
7,303
Total
 create a new
variable with 0
included
 recode to make it
interval
 collapse extreme
values to avoid outlier
issues
 missing, not zero
69
7,303
Total
Number of Trail Uses / year
Average Trail Use
7.5
6.9
52%
5.0
Mean = 5.46
3.1
6%
2%
0
1
4
Slope = 1.51
19%
17%
9
3%
12
20
Lowest
quarter $
1
2nd Qtr $
3rd Qtr $
Top Qtr $
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
II.
DUMMY VARIABLES 2.0 -- All 2+ nominal, ordinal variables
Marital Status
1
2
3
4
Single
Other adults in HH
Partnership
Married
 Policy-relevant nominal variable
N
23%
17%
4%
56%
100%
8,371
 added to data set,
 in syntax file,
6,284
1,491
LOOK AT INSTRUCTIONS  to resolve
inconsistencies, SPSS executes instructions in order
20,189
36,335
Marital status  Household Income
Average HH income
percentile
REGRESSION ANALYSIS . . .
Rules for choosing a reference category
Large case base
Intuitive comparison
59.3
54.7
43.9
33.2
Dummy Var (Ref
= Married)
Single
Other adults in HH
Partnership
Single
Other adults Partnership
in HH
Slope
T
-26.1
-15.3
-4.6
-72.1
-38.0
-6.3
Married
(K-1) dummies
MULTIPLE REGRESSION RESULTS
Partial Slope
Zero
order
Explained by
Education
-26.1
-15.3
-4.6
6%
16%
16%
Three-way regression
Marital Status
Single
Other adults in HH
Partnership
(Ref = Married)
-24.6
-12.8
-3.9
**
**
**
14.2
22.1
34.5
**
**
**
 Dummy variable slopes for education
** p < .05
 how to report significance tests
 Amount of zero order effect due to
education
Education
HSG
Some college
College grad+
(Ref = 0-11)
Use these procedures for more complicated typologies – e.g.
 Single Men
 Men married to an employed woman
 Men married to a non-employed woman
2
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
III.
CAUSAL MODELING -- A logical system for diagramming causal order and
explaining causal impact
A. ZERO ORDER Start with . . . THEORY 1: Higher income people are more likely to be in
excellent health . . .
Income  Health Status (Self-report Exc v. Good/Fair/Poor)
Proportion "Excellent" Health
X1  Y
.70
.60
 Proportion “Excellent” =
.229 + .004*(Income percentile)
.50
.40
T(slope) = 28
.30
.20
.10
Income Percentile
.00
5
18
29
43
58
71
83
95
B. Two variable system graph
B1
X1
 two variable system, squares are variables, arrows
are paths, B’s are slopes
 B1 is estimated with the zero order, bivariate
regression of Y on X1
Y
C. INTERVENING . . . THEORY 1a: The income difference is due to healthier life styles
(walking, hiking bicycling) Income  Outdoor Activity  Health
Proportion "Excellent" Health
DATA MINING CHECK . . . X2  Y
Significantly related, not curvilinear

Hiking  health B = .009 T=8
.60
.50
.40
.30
.20
.10
Frequency of outdoor activity
.00
0
1
4
9
12
20
DATA MINING CHECK . . . X1  X2
Significantly related, not curvilinear
Frequency of outdoor activity
10.0
8.0
6.0
Outdoor activity score =
2.59 + .0613 * (Income pctile) T(slope) = 19
4.0
2.0
Income Percentile
0.0
5
18
29
43
58
71
83
95
3
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
D. Three variable system graph, X2 intervening
B1
X1
B2
 B1, B3 are estimated with the
stepwise multiple regression, X1
first, then X2
 analyze % of Zero order
explained by X2
 B2 is the zero order, bivariate
regression of X2 on X1
Y
B3
X2
MULTIPLE REGRESSION RESULTS
Partial Slope
Income Percentile
Outdoor Activity


Zero
order
0.0033 **
0.0037
0.0058 **
** p < .05
Explained by
Outdoor
Activity
10%
 10% of the income effect is
explained by life style choices
X2 is called an intervening variable because it is caused by X1, and goes on to cause Y
The impact of controlling the intervening variable shows how much of the causal impact of
X1 (income) on Y (health) goes down this causal pathway (lifestyle choice) – 10%
o At present we derive the causal impact by subtraction (ZERO – PARTIAL)
o But in a three-variable system you can also estimate it directly by multiplying B2*B3
o ZERO ORDER B1
= PARTIAL B1
+ B2 * B3
.0037
=
.0033
+ .0613 * .0058
The graph tells you the regression equation to run and how to report the results

E. CAUSALLY PRIOR
THEORY 1b  The income difference in health status is explained by the fact that higher
income people are more likely to have had a higher education, and higher educated people
know more about how to live a healthy life style.
1. When X2 is causally prior then the explanation is due to the fact that both X1 and Y
depend on some third factor.
B1
X1
B2
 B1, B3 are estimated with the
stepwise multiple regression, X1
first, then X2
 analyze % of Zero order explained
by X2, using stepwise output
 B2 is the zero order, bivariate
regression of X1 on X2
Y
B3
X2
4
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
2. When X2 is causally prior, the amount of explanation of the ZERO ORDER relationship
is referred to as a SPURIOUS COMPONENT
MULTIPLE REGRESSION RESULTS
Zero
order
Partial Slope
Income Percentile
Education
HSG
Some college
College grad+
(Ref = 0-11)
0.0025 **
0.08
0.15
0.24
0.0037
Spurious
result of
Education
34%
 lots of explanation
 K-1 dummies comprise X2
 If any are significant then
you leave them all in the
equation and the report
**
**
**
** p < .05
 It is a SPURIOUS COMPONENT because it is a part of the apparent relationship between
income and health that is due to the fact that both depend on a third variable -- education
 You find the spurious component by controlling for the causally prior variable (education)
and looking at the remaining (PARTIAL) relationship between income and health
 You have to get the spurious component by subtraction, can’t multiply B1*B2 – that now
means something else
IV.
DETERMINING CAUSAL ORDER AMONG VARIABLES
If you don’t know the causal order between X1 and Y, there is very little research you can do.
X1
Y
X1
Y
X1
X1
Y
Y
Depicted as 2-headed arrow or 2
arrows
Can’t use slopes, since they are
measures of 1-way causal impact
Can do correlational analysis – Chi
square – there is a difference, we
don’t know which way it goes.
 There is no statistical test for determining causal order, it has to come from your
knowledge of how the world works. It is arbitrary. It is approximate.
A. Rules for assigning causal order X  Y
a. X happened first, earlier in life, … adolescent experiences  adult experiences
b. Y starts after X stops … education  earnings
c. Change in X precedes change in Y … divorced  happiness
d. X never changes, Y sometimes changes … gender  employment status
e. X doesn’t change much, Y changes more often … income  TV usage, opinions
B. Specifying the causal order for control variables determines whether you are
o elaborating the reasons for how a causal chain works Income  health
o Or whether you are showing that the causal impact is not as great as people think
because some of the apparent ZERO ORDER relationship is spurious
5
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
V.
A FULLY SPECIFIED SYSTEM . . . Prior and Intervening
THEORY 1c  The income difference in health status is explained partly by healthier life style
and partly by the fact that higher income people are more likely to have had a higher education,
and higher educated people know more about how to live a healthy life style.
A. Four variable system, X2 prior, X3 intervening
B1
X1
 Zero order, Direct, Indirect and
Spurious effects to be measured
Y
B3
B2
X2
B4
B5
B6
X3
B. Calculate a three step regression model, listwise deletion. . .
Model
1
2
3
B
t
p
(Constant)
.1930
16.3
.000
incomeINTERVAL Percentile of HH income
.0042
20.3
.000
(Constant)
.1168
7.1
.000
incomeINTERVAL Percentile of HH income
.0028
12.0
.000
educHSG dummy var HSG vs 0-11
.0996
5.1
.000
educANYCOLL dummy any coll vs 0-11
.1380
6.9
.000
educCOLLGRAD dummy coll grad vs 0-11
.2612
12.6
.000
(Constant)
.1099
6.7
.000
incomeINTERVAL Percentile of HH income
.0026
11.2
.000
educHSG dummy var HSG vs 0-11
.0958
4.9
.000
educANYCOLL dummy any coll vs 0-11
.1271
6.3
.000
educCOLLGRAD dummy coll grad vs 0-11
.2436
11.6
.000
TRAILuseSCALE
.0045
5.6
.000
 X1 ZERO ORDER
 X1 + prior vars (X2)
 X1 + priors (X2) +
intervening vars (X3)
C. Analyze the full regression equation from model 3
Proportion “Excellent Health” = .11 + .0026*Income Percentile + .10*High School
Grad + .13*Any College + .24*College Grad + .005*Lifestyle Choice
REGRESSION RESULTS
Income Percentile
Education
HSG
Some college
College grad+
(Ref = 0-11)
Outdoor Activity
Slope
0.0026
T
11.2
0.0958
0.1271
0.2436
4.9
6.3
11.6
0.0045
5.6
6
 Summarize the slopes comment on
significance, pattern of dummies, etc
SPS 580 Lecture 8 Dummy vars 2.0 Spurious effects Causal models
D. Analyze the causal connections in the system
Key Accounting equation . . .
Zero Order = Direct Effect + Sum of all spurious components + Sum of all indirect effects
CAUSAL ANALYSIS
Zero Order Effect
0.0042
 1. X1 slope from model 1 (zero order)
Causal Effect
Direct 0.0026
Indirect 0.0002
Total Causal Effect 0.0028
Spurious Effect
VI.
0.0014
62%
4%
66%
 3. X1 slope from model 3 (all vars)
 4. Zero order – spurious - direct
 5. Direct + Indirect
34%
 2. X1 slope from model 1 minus X1
slope from model 2 (prior vars)
CAVEATS, OBSERVATIONS
A. Direct effect = amount not explained so far – what is the meaning of causality, unmeasured
process variables (OK) unmeasured prior variables (not OK)
B. The slope summary allows you to say what is significant and what is not, it does not allow
you to say which variable is more important. The units are different. To compare magnitudes
of slopes, you need a standardized unit. We will cover this next week.
C. Income slope is changing slightly between examples  due to Listwise deletion
Assignment 8:
7
Download