Elaboration, explanation and specification in graphical models Svend Kreiner Dept. of Biostatistics, University of Copenhagen Part I: The elaboration paradigm and graphical models Part II: A bootstrap evaluation of the confidence of graphical models provided by model search procedures and the estimates of associations between variables based on these models 1 Association between intelligence measured in 1968 and income reported 25 years later +Intellig | | D:--Income | I | <100. 100-1 150-1 200-2 250-3 350-4 500.0 | TOTAL | -------+-------------------------------------------+-------+ <26 | 22 42 82 56 20 9 5 | 236 | row%| 9.3 17.8 34.7 23.7 8.5 3.8 2.1 | 100.0 | 26-30 | 23 55 99 81 26 21 1 | 306 | row%| 7.5 18.0 32.4 26.5 8.5 6.9 0.3 | 100.0 | 31-35 | 37 100 147 113 40 28 14 | 479 | row%| 7.7 20.9 30.7 23.6 8.4 5.8 2.9 | 100.0 | 36-40 | 35 108 148 137 44 52 16 | 540 | row%| 6.5 20.0 27.4 25.4 8.1 9.6 3.0 | 100.0 | 41+ | 43 92 160 213 109 96 68 | 781 | row%| 5.5 11.8 20.5 27.3 14.0 12.3 8.7 | 100.0 | -----------------------------------------------------------+ TOTAL | 160 397 636 600 239 206 104 | 2342 | row% | 6.8 17.0 27.2 25.6 10.2 8.8 4.4 | 100.0 | -----------------------------------------------------------+ = 0.20 p <0.0005 2 The category collapsed table Intelligence Income <35 % -199.000 36-40 % 41+ % 59.4 53.9 37.8 200.000 - 249.000 24.5 25.4 27.3 250.000 - 499.000 14.1 17.8 26.2 2.0 3.0 8.7 1021 540 781 500.000+ n 3 The graphical model underlying category collapsibility Income ╨ Intelligence | Collapsed income Income ╨ Intelligence | Collapsed intelligence 4 Elaboration, explanation and specification A paradigm for quantitative sociological research Lazarsfeld, PF (1946) : Interpretation of Statistical Relation as a Research Operation Lazarsfeld, PF & Kendall, PL (1950): Problems of Survey Research Lazarsfeld, PF & Rosenberg, M (eds) (1955): The language of Social Research Rosenberg, M (1962): Test factor Standardization as a Method of Interpretation Davis, JA (1967): A partial coefficient for Goodman and Kruskall‘s Gamma Rosenberg, M (1968): The Logic of Survey Analysis Davis, JA (1975): Analyzing contingency tables with linear flow graphs Davis, JA (1980): Contingency table analysis: proportions and flow graphs Davis, JA (1984): Extending Rosenberg’s Technique for Standardizing Percentage Tables 5 Elaboration, explanation and specification Elaboration - Analysis of conditional association Explanation - Z explains the association between X and Y if X ╧ Y | Z Specification - Description of conditional associations that cannot be explained by other relevant variables. Is the strength of the X-Y association constant across levels of Z? 6 Graphical models and elaboration Graphical models are models defined by explanations obtained during attempts to elaborate associations in a multivariate set of variables. Graphical models provide the solution to the problem that killed the elaboration paradigm: The problem - which variables are needed in order to explain or specify the association between two variables? The solution - is given by the global Markov properties and decompositions of the Markov graph 7 Describing associations in graphical models Both collapse onto for analysis of the AD association Decomposition imply parametric and inference collapsibility Separation imply parametric collapsibility The loglinear model ACD The loglinear model AD,AC,CD Specification requires analysis Specification by the model 8 The graphical model The coefficients on edges are partial coefficients 9 Specification of the intelligence – Income association The model collapses onto the 5dimensional table containing Income, Education, School, Intelligence and Sex with respect to the Income – Intelligence association. Partial = 0.08 - a very weak positive association Nothing more can be said about the association between intelligence and Income within the inference frame defined by graphical models? 10 Loglinear modelling on top of the graphical model The graphical model is loglinear, but higher order interactions are always included. The graphical model therefore assumes that the Income-Intelligence association is modified by Education, School and Sex If we, during specification of associations, conclude/assume that the association is constant across levels defined by other variables, then the model has to be replaced by a loglinear model. Collapsibility properties of graphical models guarantee that all parameters relating to the Intelligence-Income association are included in the marginal table including Education, School and Sex. Specification of the Intelligence-Income association therefore only requires analysis of the 5-way table with these variables. 11 The marginal loglinear model The marginal model is saturated. The Income-Educ-School and Educ-School-Intel-Sex interactions are fixed in the model. Attempts to examine these interactions in the 5-way table will tell us nothing about the associations between these variables in the full model. The results of analyses of the other interaction parameter in the 5-way table also apply for the full model. At the end of the day: No evidence against Income-Intelligence, Income-Educ-Sex, Income-School-Sex, Income-Educ-School, Educ-School-Intel-Sex The Income-Intelligence association is constant over all levels of the other variables of the model. (Observed partial = 0.079 – fitted partial = 0.050 under the loglinear model). 12 The estimation problem The properties of estimates are well-known under the model. The model itself, being a result of a model search procedure, is however in itself an estimate. Very little is known about the properties of both 1) the estimates of the model 2) the estimates of unknown parameters based on the model estimates Non-parametric bootstrapping is one way to examine these properties 13 The model search procedure in this example (much better strategies are available – but not discussed here) 1) Initial screening (Kreiner, 1986) of 2- and 3-way tables defining a starting point for a proper model search procedure 2) Stepwise naïve p-value driven search (backwards – and forwards). P-values are Monte Carlo estimates (Kreiner, 1987) based on 400 random tables for each hypothesis. Significance is evaluated at a 1 % critical level The screening will – apart from statistical errors – identify the parts of the models defined as strings and trees defined by cliques sharing only one edge. Naïve p-value driven model search procedures are not consistent. Type II errors may disappear, but type I errors (spurious edges) will continue to turn up even though n increases 14 How reliable is the estimate of the association between intelligence and income? G = the “true” graphical model The partial is estimated relative to G If G is known then (G) has nice asymptotic properties. If intelligence & Income is connected in G then P( ˆ (G ) | G ) Norm( , 2 ) If intelligence & Income is disconnected in G then ˆ (G ) 0 G is not known. must therefore be estimated relative to the estimate of the model, Ĝ . Let ~ ˆ (Ĝ ) be the estimate under Ĝ . The distribution of ~ ˆ (Ĝ ) is not known and (probably) not nice The properties of Ĝ and ~ ˆ (Ĝ ) can be examined by naïve nonparametric bootstrapping. 15 How stable is the model estimate Solid lines 95+% confidence Normal lines 80-94.9 % conficence Dashed lines 20-79.9 % confidence Mean edge entropi = 0.317 (0.904/0.096 distribution) Mean number of departures from data model = 6.5 (18.0 %) 16 Estimating partial coefficients The model found in the original data collapses on a table with Income, Intelligence, Education, School, Sex for estimation of the partial coefficient 21.8 % of the bootstrapped models collapse on the same table even though none of the bootstrapped models are equal to the data model 17 Bootstrap distribution of collapsibility properties Frequency Percent Valid Percent Cumulative Percent FGM 110 22,0 22,0 22,0 BCFGLM 86 17,2 17,2 39,1 FGLM 65 13,0 13,0 52,1 BCFGM 52 10,4 10,4 62,5 BCFGKM 28 5,6 5,6 68,1 BCFGKLM 24 4,8 4,8 72,9 CFGLM 22 4,4 4,4 77,2 BFGLM 19 3,8 3,8 81,0 BFGM 17 3,4 3,4 84,4 FLM 17 3,4 3,4 87,8 BCFLM 11 2,2 2,2 90,0 FGKM 9 1,8 1,8 91,8 FKM 9 1,8 1,8 93,6 BCFKLM 6 1,2 1,2 94,8 CFGM 4 ,8 ,8 95,6 CFKLM 4 ,8 ,8 96,4 BFLM 3 ,6 ,6 97,0 CFGKM 3 ,6 ,6 97,6 BFGKM 2 ,4 ,4 98,0 CFLM 2 ,4 ,4 98,4 FGKLM 2 ,4 ,4 98,8 FKLM 2 ,4 ,4 99,2 FM 2 ,4 ,4 99,6 BFGKLM 1 ,2 ,2 99,8 CFGKLM 1 ,2 ,2 100,0 Total 501 100,0 100,0 18 Partial coefficient estimated in tables defined by decomposition reduchyp: decomposition 80 Frequency 60 40 20 Mean =0,071078 Std. Dev. =0,0403192 N =501 0 -0,1000 -0,0500 0,0000 0,0500 0,1000 gamma 19 0,1500 0,2000 Means of partial gamma coefficients Estimated partial coefficients under different collapsibility assumptions 0,1250 0,1000 0,0878 0,0750 0,0500 BC BC BC FG FG FG KL KM LM M BC BC BC FG FKL FL M M M BF BF GK GK LM M BF BF BFL CF CF CF CF CF GL GM M GK GK GL GM KL M LM M M M Included variables 20 CF LM FG FG FG KL KM LM M FG FKL FK M M M FL M FM may sometimes be estimated in collapsed tables defined both by separation and decompositions. Collapsed tables defined by separation are smaller, loglinear parameters of interest are the same, but separation does not guarantee collapsibility of partial coefficients. These may therefore be systematically different from those estimated in collapsed tables defined by decomposition Are there any apparent differences when we compare the bootstrap estimates? 21 The association between coefficients estimated under different collapsibility conditions reduchyp: decomposition Gamma estimated under separation 0,20 0,15 0,10 0,05 0,00 -0,05 R Sq Linear = 2,124E-4 -0,10 -0,1000 -0,0500 0,0000 0,0500 0,1000 0,1500 0,2000 gamma Uncorrelated estimates. Estimates from tables defined by separation are unbiased 22 No significant difference between estimates in tables defined by separation and decomposition Table defined by Mean Std. Deviation separation ,074336 ,0372415 decomposition ,071078 ,0403192 Total ,072522 ,0389971 23 The distribution of the gamma coefficients when edges are present in the model. Mean = 0.10, s.d. = 0.0216 40 Frequency 30 20 10 Mean =0,1015 Std. Dev. =0,0216 N =243 0 0,00 0,05 0,10 DI 24 0,15 0,20 The distribution of gamma coefficients including zeros implied by missing edges. Mean = 0.049, s.d. = 0.053 300 250 Frequency 200 150 100 50 Mean =0,0492 Std. Dev. =0,05296 N =501 0 0,00 0,05 0,10 DI 25 0,15 0,20