EJH92B analysis

advertisement
Elaboration, explanation and specification in graphical models
Svend Kreiner
Dept. of Biostatistics, University of Copenhagen
Part I: The elaboration paradigm and graphical models
Part II: A bootstrap evaluation of the confidence of graphical models provided by model
search procedures and the estimates of associations between variables based on these models
1
Association between intelligence measured in 1968 and income
reported 25 years later
+Intellig
| | D:--Income
|
I | <100. 100-1 150-1 200-2 250-3 350-4 500.0 | TOTAL |
-------+-------------------------------------------+-------+
<26 |
22
42
82
56
20
9
5 |
236 |
row%|
9.3 17.8 34.7 23.7
8.5
3.8
2.1 | 100.0 |
26-30 |
23
55
99
81
26
21
1 |
306 |
row%|
7.5 18.0 32.4 26.5
8.5
6.9
0.3 | 100.0 |
31-35 |
37
100
147
113
40
28
14 |
479 |
row%|
7.7 20.9 30.7 23.6
8.4
5.8
2.9 | 100.0 |
36-40 |
35
108
148
137
44
52
16 |
540 |
row%|
6.5 20.0 27.4 25.4
8.1
9.6
3.0 | 100.0 |
41+ |
43
92
160
213
109
96
68 |
781 |
row%|
5.5 11.8 20.5 27.3 14.0 12.3
8.7 | 100.0 |
-----------------------------------------------------------+
TOTAL |
160
397
636
600
239
206
104 | 2342 |
row%
|
6.8 17.0 27.2 25.6 10.2
8.8
4.4 | 100.0 |
-----------------------------------------------------------+
 = 0.20 p <0.0005
2
The category collapsed table
Intelligence
Income
<35
%
-199.000
36-40
%
41+
%
59.4
53.9
37.8
200.000 - 249.000
24.5
25.4
27.3
250.000 - 499.000
14.1
17.8
26.2
2.0
3.0
8.7
1021
540
781
500.000+
n
3
The graphical model underlying category collapsibility
Income ╨ Intelligence | Collapsed income
Income ╨ Intelligence | Collapsed intelligence
4
Elaboration, explanation and specification
A paradigm for quantitative sociological research
Lazarsfeld, PF (1946) : Interpretation of Statistical Relation as a Research Operation
Lazarsfeld, PF & Kendall, PL (1950): Problems of Survey Research
Lazarsfeld, PF & Rosenberg, M (eds) (1955): The language of Social Research
Rosenberg, M (1962): Test factor Standardization as a Method of Interpretation
Davis, JA (1967): A partial coefficient for Goodman and Kruskall‘s Gamma
Rosenberg, M (1968): The Logic of Survey Analysis
Davis, JA (1975): Analyzing contingency tables with linear flow graphs
Davis, JA (1980): Contingency table analysis: proportions and flow graphs
Davis, JA (1984): Extending Rosenberg’s Technique for Standardizing Percentage
Tables
5
Elaboration, explanation and specification
Elaboration
-
Analysis of conditional association
Explanation
-
Z explains the association between X and Y if X ╧ Y | Z
Specification
-
Description of conditional associations that cannot be
explained by other relevant variables. Is the strength
of the X-Y association constant across levels of Z?
6
Graphical models and elaboration
Graphical models are models defined by explanations obtained during attempts to
elaborate associations in a multivariate set of variables.
Graphical models provide the solution to the problem that killed the elaboration
paradigm:
The problem
-
which variables are needed in order to explain or specify
the association between two variables?
The solution
-
is given by the global Markov properties and
decompositions of the Markov graph
7
Describing associations in graphical models
Both collapse onto for analysis of the AD association
Decomposition imply
parametric and
inference collapsibility
Separation imply
parametric
collapsibility
The loglinear model
ACD
The loglinear model
AD,AC,CD
Specification requires
analysis
Specification by the
model
8
The graphical model
The
coefficients
on edges are
partial 
coefficients
9
Specification of the intelligence – Income association
The model collapses onto the 5dimensional table containing
Income, Education, School,
Intelligence and Sex with respect to
the Income – Intelligence
association.
Partial  = 0.08 - a very weak
positive association
Nothing more can be said about the association between intelligence and Income within the
inference frame defined by graphical models?
10
Loglinear modelling on top of the graphical model
The graphical model is loglinear, but higher order interactions are always included.
The graphical model therefore assumes that the Income-Intelligence association is
modified by Education, School and Sex
If we, during specification of associations, conclude/assume that the association is
constant across levels defined by other variables, then the model has to be replaced by
a loglinear model.
Collapsibility properties of graphical models guarantee that all parameters relating to
the Intelligence-Income association are included in the marginal table including
Education, School and Sex.
Specification of the Intelligence-Income association therefore only requires analysis of
the 5-way table with these variables.
11
The marginal loglinear model
The marginal model is saturated.
The Income-Educ-School and Educ-School-Intel-Sex interactions are fixed in the
model. Attempts to examine these interactions in the 5-way table will tell us nothing
about the associations between these variables in the full model.
The results of analyses of the other interaction parameter in the 5-way table also apply
for the full model.
At the end of the day:
No evidence against
Income-Intelligence,
Income-Educ-Sex, Income-School-Sex,
Income-Educ-School,
Educ-School-Intel-Sex
The Income-Intelligence association is constant over all levels of the other variables of
the model. (Observed partial  = 0.079 – fitted partial  = 0.050 under the loglinear
model).
12
The estimation problem
The properties of estimates are well-known under the model.
The model itself, being a result of a model search procedure, is however in itself an
estimate.
Very little is known about the properties of both
1) the estimates of the model
2) the estimates of unknown parameters based on the model estimates
Non-parametric bootstrapping is one way to examine these properties
13
The model search procedure in this example
(much better strategies are available – but not discussed here)
1) Initial screening (Kreiner, 1986) of 2- and 3-way tables defining a starting
point for a proper model search procedure
2) Stepwise naïve p-value driven search (backwards – and forwards). P-values
are Monte Carlo estimates (Kreiner, 1987) based on 400 random tables for
each hypothesis. Significance is evaluated at a 1 % critical level
The screening will – apart from statistical errors – identify the parts of the models
defined as strings and trees defined by cliques sharing only one edge.
Naïve p-value driven model search procedures are not consistent.
Type II errors may disappear, but type I errors (spurious edges) will continue to turn
up even though n increases
14
How reliable is the estimate of the association between
intelligence and income?
G = the “true” graphical model
The partial  is estimated relative to G
If G is known then (G) has nice asymptotic properties.
If intelligence & Income is connected in G then P( ˆ (G ) | G )  Norm( ,  2 )
If intelligence & Income is disconnected in G then ˆ (G )  0
G is not known.  must therefore be estimated relative to the estimate of the
model, Ĝ .
Let ~  ˆ (Ĝ ) be the estimate under Ĝ .
The distribution of ~  ˆ (Ĝ ) is not known and (probably) not nice
The properties of Ĝ and ~  ˆ (Ĝ ) can be examined by naïve nonparametric
bootstrapping.
15
How stable is the model estimate
Solid lines
95+% confidence
Normal lines
80-94.9 % conficence
Dashed lines
20-79.9 % confidence
Mean edge entropi = 0.317 (0.904/0.096 distribution)
Mean number of departures from data model = 6.5 (18.0 %)
16
Estimating partial  coefficients
The model found in the original data collapses on a table with
Income, Intelligence, Education, School, Sex
for estimation of the partial  coefficient
21.8 % of the bootstrapped models collapse on the same table even though
none of the bootstrapped models are equal to the data model
17
Bootstrap distribution of collapsibility properties
Frequency Percent Valid Percent Cumulative Percent
FGM
110
22,0
22,0
22,0
BCFGLM
86
17,2
17,2
39,1
FGLM
65
13,0
13,0
52,1
BCFGM
52
10,4
10,4
62,5
BCFGKM
28
5,6
5,6
68,1
BCFGKLM
24
4,8
4,8
72,9
CFGLM
22
4,4
4,4
77,2
BFGLM
19
3,8
3,8
81,0
BFGM
17
3,4
3,4
84,4
FLM
17
3,4
3,4
87,8
BCFLM
11
2,2
2,2
90,0
FGKM
9
1,8
1,8
91,8
FKM
9
1,8
1,8
93,6
BCFKLM
6
1,2
1,2
94,8
CFGM
4
,8
,8
95,6
CFKLM
4
,8
,8
96,4
BFLM
3
,6
,6
97,0
CFGKM
3
,6
,6
97,6
BFGKM
2
,4
,4
98,0
CFLM
2
,4
,4
98,4
FGKLM
2
,4
,4
98,8
FKLM
2
,4
,4
99,2
FM
2
,4
,4
99,6
BFGKLM
1
,2
,2
99,8
CFGKLM
1
,2
,2
100,0
Total
501
100,0
100,0
18
Partial  coefficient estimated in tables defined by decomposition
reduchyp: decomposition
80
Frequency
60
40
20
Mean =0,071078
Std. Dev. =0,0403192
N =501
0
-0,1000
-0,0500
0,0000
0,0500
0,1000
gamma
19
0,1500
0,2000
Means of partial gamma coefficients
Estimated partial  coefficients under different collapsibility assumptions
0,1250
0,1000
0,0878
0,0750
0,0500
BC BC BC
FG FG FG
KL KM LM
M
BC BC BC
FG FKL FL
M
M
M
BF BF
GK GK
LM M
BF BF BFL CF CF CF CF CF
GL GM M GK GK GL GM KL
M
LM M
M
M
Included variables
20
CF
LM
FG FG FG
KL KM LM
M
FG FKL FK
M
M
M
FL
M
FM
 may sometimes be estimated in collapsed tables defined both by
separation and decompositions.
Collapsed tables defined by separation are smaller, loglinear parameters of
interest are the same, but separation does not guarantee collapsibility of
partial  coefficients.
These may therefore be systematically different from those estimated in
collapsed tables defined by decomposition
Are there any apparent differences when we compare the bootstrap estimates?
21
The association between  coefficients estimated under different collapsibility
conditions
reduchyp: decomposition
Gamma estimated under separation
0,20
0,15
0,10
0,05
0,00
-0,05
R Sq Linear = 2,124E-4
-0,10
-0,1000
-0,0500
0,0000
0,0500
0,1000
0,1500
0,2000
gamma
Uncorrelated estimates. Estimates from tables defined by separation are unbiased
22
No significant difference between estimates in tables defined by
separation and decomposition
Table defined by
Mean
Std. Deviation
separation
,074336
,0372415
decomposition
,071078
,0403192
Total
,072522
,0389971
23
The distribution of the gamma coefficients when edges are
present in the model.
Mean = 0.10, s.d. = 0.0216
40
Frequency
30
20
10
Mean =0,1015
Std. Dev. =0,0216
N =243
0
0,00
0,05
0,10
DI
24
0,15
0,20
The distribution of gamma coefficients including zeros
implied by missing edges. Mean = 0.049, s.d. = 0.053
300
250
Frequency
200
150
100
50
Mean =0,0492
Std. Dev. =0,05296
N =501
0
0,00
0,05
0,10
DI
25
0,15
0,20
Download