Bayesian Causalities, Mappings, and Phylogenies: A Social Science Gateway for Modeling Ethnographic, Archaeological, Historical Ecological and Biological Variables CS-DC’15 Panel on Synthesis of Ecological, Biological and Ethnographic Data 9-13 Paul Rodriguez (SDSC), Eric Blau, Stuart Martin, Lukasz Lacinski, Thomas Uram (Argonne), Feng Ren (Xiamen), Wesley Roberts (Carnegie), Tolga Oztan, Douglas White (UCI) Abstract 186BayesianCausalities_8-30.docx Extending the innovative “Def Wy” procedures for modeling evolutionary network effects (Dow, 2007; Dow and Eff’s, 2009a, 2009b), a Complex Social Science http://intersci.ss.uci.edu (CoSSci) Gateway was developed to provide complex analyses of ethnographic, archaeological, historical, ecological, and biological datasets with easy open access. Analysis begins with dependent variable y with n observations and X independent and other variables, and imputes missing data for all variates. Several (n x n) W* matrices measure evolutionary network effects such as diffusion or phylogenetic ancestries. W* is row-normalized to sum to 1 and combined to obtain a W, multiplied by X as WX, and allowing X and y multiplication by W: Wy measures the evolutionary autocorrelation portion of y discounting evolutionary effects of propinquity and phylogenetics. Tested for exogeneity (error terms uncorrelated with Wy or independent variables) the two-stage Ordinary Least Squares (OLS) results include measures of independent variable and deep evolutionary autocorrelation predictors. We show how these methods apply to a wide variety of problems in the social sciences to which ecological and biological variables will apply once contributed. Introduction. We use DEf Wy procedures to begin to integrate analysis of ethnographic, archaeological, ecological, historical empires and biological variables into complex system analyses of human evolutionary (deep structure and processes) and causal analysis of networks of variables at a given set or series of time periods (surface structure). At this time our datasets of societal samples and variables relate to specific places and times, and the focus of these variables has been largely sociocultural but it is imperative – given the role of biology in evolutionary processes – that our data are collated to variables (and models) from bioecological sciences pinpointed or encompassing where possible the spatiotemporal sites of our datasets. This is an ambitious undertaking but one successful example of this is a result of Botero et al. (2014) who predict the High Moral God variable from multiplicative relationships between bioecological stability and abundance variables (Map Figures 1 and 2) when added to the Ethnographic Atlas dataset of 1265 societies (White 2015). The idea that multiplying features of two subcontinents can get predictive results is dubious.1 Figure 1: Ethno Atlas Stability Biovariable Figure 2: Ethno Atlas Abundance Biovariable 1 Specifically, when the lower-values (small yellow datapoints) of the bioecological variables are multiplied the resultant variable predicts “Moral High God” religions such as Christianity (Map 1) and Islam (Map 2) which together constitute the majority of “Moral High God” religions, as shown by White (2015). The deficiency of the Botero et al. model is that their results are invalidated by endogeny due to failure to apply appropriate tests of whether their predictors of the deep evolutionary background effects of language phylogeny and spatial diffusion alongside predictors of specific independent variables are correlated with the error terms of their regression. Methods: When we use DEf Wy R software implementation with appropriate diagnostics (Dow, personal communication), we are able to solve diagnostics for exogeneity by use of two-stage OLS to separate the evolutionary network effects from independent variable effects. By producing Wy variables explicitly, as well as imputing missing values, their method enables further analysis and comparisons, such as using a Bayesian Network to examine the network of variables. Dow and Eff (2013:218-219) provide a succinct description of their method: Dow (2007) recently proposed that Galton’s Problem be formulated as a network autocorrelation effects regression model, where the autocorrelated dependent variable is included as an endogenous predictor variable. It is straightforward to extend the usual regression model to include an additional variable that incorporates the effects of trait transmission on the distribution of percentage of married females in monogamous marriages, the dependent variable (y) in our proposed model: y = α + λWy + Xβ+ ε (1) where ε is a n x 1 vector of normally distributed error terms with zero mean, X is a n x k matrix of exogenous variables, β is a k x 1 vector of regression coefficients, α is the intercept, and λ is the scalar network autocorrelation effect coefficient. The first independent variable on the right of the equals sign is the product of the square n x n trait transmission matrix W and the n x 1 vector of scores on the dependent y variable. OLS regression requires that all of the independent variables be uncorrelated with the errors, ε, otherwise all of the estimated regression coefficients will be biased. Two-stage least squares (2SLS) estimation procedures are commonly used to deal with endogenous predictor variables. The first step in the 2SLS estimation of equation (1) is to generate an estimate of Wy that is independent of the ε term. This can be done by regressing Wy on one or more “instrumental” variables, which are independent variables that predict Wy but are uncorrelated with ε. Kelejian and Prucha (1998) suggest WX as a suitable set of instrumental variables,2 so that the following equation is estimated at stage 1 using OLS: Wy = a + WXc + e (2) Which allows computation of ŷ, an exogenous predictor for Wy: ŷ = â+WXĉ (3) The vector of predicted scores is then entered into equation (1) and the following stage 2 equation is estimated, again using OLS regression: y=α+λŷ +Xβ+ε (4) The estimates from this second stage model permit valid inferences about the effect of trait diffusion net of the functional associations (assessed by the estimated λ and its associated significance level), and the functional associations net of diffusion (the estimated β coefficients and their significance levels). It is straightforward to extend the 2SLS approach to handle multiple W matrices simultaneously. However, network matrices representing commonly posited diffusion processes can be highly correlated with each other in any particular study, thus the use of multiple matrices may result in problems of multicollinearity and disentangling their 2 separate contributions may be difficult. Dow and Eff (2009a) suggest one approach to handling this problem: combine multiple matrices into a single W matrix. Results: Table 1 gives a result of DEf Wy applied to a “Stages of Religious Evolution” (SRE) variable v2013 (Sanderson and Roberts 2008), using the 186 society Standard CrossCultural Sample (Murdock and White 1969) where the number of variables contributed by many other researchers has grown to 2109. Table 1: CoSSci DEf01 Wy Result for an SCCS model with a Box-Cox power coefficient for the Dependent Variable v2013, Evolution of God (Sanderson and Roberts 2008), where lambda (λ) is the Box-Cox coefficient of predictability. More recent analyses lead to inclusion of more independent variables with exogeneity in Diagnostics and higher R2=0.72. The analysis employs a Box-Cox dependent variable with coefficient y(λ)=(yλ-1)/λ=9.656 that calculates an optimal power threshold for the model (R2=0.5478, modified to R2=0.72 with the additional variables. Group significance tests may be used to select among these candidate variables those that pass the Holm-Bonferroni (H-B) test using the OLS derived p-values. We then further analyzed the network of variables using Bayesian networks learning procedures with bootstrapping to produce the analysis shown in Figure 3, which includes the SRE v2013 dependent and independent variables and a set of candidate model variables. In this step, all variables are analyzed for potential inter-relations using Bayesian networks with structure learning methods as implemented in the R library(bnlearn). Structure learning finds likely edges by combining a search for likely conditional dependencies and global network scores (Scutari and Denis, 2014). Choosing edges are made more robust by taking random bootstrap samples of the data, learning a network for each sample, and then taking statistics over the samples of networks. HPC (High Performance Computing) resources at San Diego Supercomputer Center was used to enable a very large number bootstrap samples. 3 Figure 3: Edge selection over bootstrapped samples of Bayes network shows separate graph clusters after thresholding. Variables are: v2013 EvoGod (the dependent variable of the DEf01 Wy model) | Superjurisdictional HierarchyXWriting=SuperjhWriting | v149 Writing | v155 Money | Wy=Evolutionary effects of DEf01 Wy diffusion and language phylogeny | v232 Intensity of Cultivation | v207 Dependence on Agriculture | AnimXbwealth | AnimXbwealthLoRain2003. Some of the latter variables are connected to the EvoGod DEf01 Table 1 Wy variable but not intrinsically. This graph includes dependent and independent variables in Table 1, and additional set of “unrestricted variables” (to be explored). Additional clusters show a set of // climate variables bio.1 …bio.11 | v2005 Water | 2003 Rain | No_rain_Dry // and two pairs of variables: // DepOnHuntGath | v203 Dependence on Gathering // 2125 Wage labor | FxCmntyWages //. Wy is correlated with the EvoGod variable as shown in the Figure, which is also a feature of Table 1. It is worth noting that the Wy variables from two-stage OLS provides a kind of latent variable for the Bayesian network structure learning. This helps alleviate a potential concern with hidden causes among variables in Bayesian networks. One of the deficiencies of comparing DEf01 Wy results with Bayesian structure learning is that some variables need to be discretized for Bayesian learning of conditional probability tables. The Bnlearn package contains discretization functions to best transform variables to maintain correlation between independent and dependent variables. Discussion - Deep (DEf01 Wy) and Surface (Bnlearn) Causality: DEf01 Wy regression often renders, in two stages, a deep model of causality where past history is shown to have affected, as in Table 1: diffusion, and linguistic ancestry, characteristic of deep evolutionary effects on human societies contrasted with surface causalities as in Figure 3. This approach is potentially applicable to all analyses where indicators of historical or evolutionary processes are significant. Endogeneity and Exogeneity: Generic and Specific 4 Endogeneity “includes omitted variables, omitted selection, simultaneity, commonmethod variance, and measurement errors – renders estimates causally uninterpretable” (Antonakis et al. 2010:1086). This includes regression results in which the independent variables are correlated with residual errors. DEf01 Wy results are potentially replete with measures of significant endogeneity that destroy attempts to obtain results that represent potential causality, as noted in Antonakis et al. (2010) for, e.g., the Ramsey RESET test (p18), Hausman test (p19), Breusch-Pagan Heteroskedascity (p. 28), shown as Diagnostics in Table 1 but that also include the Shapiro-Wilks, and Breusch–Godfrey autocorrelation LM tests for distance, language and ecology. Exogeneity, where dependent and independent variables are uncorrelated with regression errors, is intrinsic to validity in Bayesian network learning, although omitted variables, omitted selection, simultaneity, common-method variance, and measurement errors may still produce endogeneity. If properly specified on all these dimensions, library(bnlearn) results, recently enhanced in numerous publications for implementation (Højsgaard, Edwards, Lauritzen 2012, Scutari and Nagarajan 2013, and Nagarajan, Scutari and Lèbre 2013), provide determinant results that we call surface causality, as above: causality that omits autocorrelation. These are distinct from the effects of common histories of the units in a sample, or what has been called Galton’s problem in the social sciences. The only other alternative approaches to causal estimates in observational studies that provide results comparable to experiments are those of Cook, Stadish and Wong (2008), Stadish and Cook (2009), and Thistlethwaite and Campbell (1960), but these approaches (see Antonakis et al. 2010), like controlled experimental studies that don’t measure deep histories as does DEf01 Wy, measure surface rather than deep causalities. Summary The ongoing goal of this project has been articulated as possibilities of collaboration between bioecological scientists such as Botero et al. (2014), and exemplified by James Brown’s (1995) innovation in developing an early framework in macroecology that has been continued by many others, including Lomolino, Brown, Whittaker, and Riddle (2010), and Harcourt (2012), a member of our panel, and other members of our panel. Acknowledgements: Douglas R. White thanks the Santa Fe Institute for hosting multiple one to two week Causality Working Groups engaging Tolga Oztan, Peter Turchin, Amber Johnson and many others on this topic in 2010-2014 and to Jü rgen Jost and the MPI for Mathematics in the Sciences for hosting of our working group in June 2011. We thank Anthon Eff for his immense work in building the R code prior to and as used in CoSSci. References Antonakis, John. Samuel Bendahan, Philippe Jacquart, Rafael Lalive. 2010. On making causal claims: A review and recommendations. The Leadership Quarterly 21(6). 1086-1120. Antonakis, John. Samuel Bendahan, Philippe Jacquart, Rafael Lalive. 2014. Causality and endogeneity: Problems and solutions. In D.V. Day (Ed.), The Oxford Handbook of Leadership and Organizations pp. 93-117. New York: Oxford University Press. Botero, Carlos, Beth Gardner, Kathryn Kirby, Joseph Bulbulia, Michael Gavin, Russell D. Gray. 2014. The ecology of religious beliefs. Proceedings National Academy of Sciences, Vol 111, No. 47, 16784-89. Brown, James H. 1995. Macroecology. University of Chicago Press. 5 Cook, T. D., Shadish, W. R., and Wong, V. C. 2008. Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from withinstudy comparisons. Journal of Policy Analysis and Management, 27(4), 724-750. Dow, M. M. (2007). Galton’s Problem as multiple network autocorrelation effects. CrossCultural Research, 41, 336-363. Dow, Malcolm M, and E. Anthon Eff. 2009a. Cultural Trait Transmission and Missing Data as Sources of Bias in Cross-Cultural Survey Research: Explanations of Polygyny Re-examined. Cross-Cultural Research 43(2):134-151. Dow, Malcolm M, and E. Anthon Eff. 2009b. Multiple Imputation of Missing Data in CrossCultural Samples. Cross-Cultural Research 43(3):206-229. Dow, Malcolm, and Anthon Eff. 2013. Determinants of monogamy. Journal of Social, Evolutionary, and Cultural Psychology 7(3):211-238. Højsgaard, Søren, David Edwards, Steffen Lauritzen. 2012. Graphical Models with R. Springer. Harcourt, Alexander. 2012. Human Biogeography. Chicago: University of Chicago Press. Holm, Sture. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6(2): 65–70. Kelejian, Harry H., Prucha, Ingmar R. 1998 A Generalized Spatial Two-Stage Least Squares Procedure for Estimating a Spatial Autoregressive Model with Autoregressive Disturbances. Journal of Real Estate Finance and Economics, Vol. 17:1, 99-121. Lomolino, Mark V., James H. Brown, Robert Whittaker, Brett R. Riddle. 2010. Biogeography 4th edition. Sunderland, Mass. Sinauer Associates, Inc. Murdock, George P. 1967. Ethnographic Atlas. Pittsburgh: Pittsburgh University Press. Murdock, George P., and Douglas R. White. 1969. Standard Cross-Cultural Sample. Ethnology 8(4):329-369. Nagarajan, R., M. Scutari and S. Lèbre. 2013. Bayesian Networks in R with Applications in Systems Biology. Use R!, Vol. 48, Springer (US). Sanderson, Stephen K., Wesley W. Roberts. 2008. The Evolutionary Forms of Religious Life: A Cross-Cultural, Quantitative Analysis. American Anthropologist 110(4): 454-466. Scutari, Marco, Radhakrishnan Nagarajan. 2013. On Identifying Significant Edges in Graphical Models of Molecular Networks. Artificial Intelligence in Medicine 57(3):207-217. http://www.aiimjournal.com/article/S0933-3657(12)00154-6/abstract?cc=y= Scutari, Marco, and J.-B. Denis. 2014. Bayesian Networks, with examples in R. Texts in Statistical Science, Chapman & Hall/CRC (US). 6 Shadish, W. R., and Cook, T. D. 2009. The renaissance of field experimentation in evaluating interventions. Annual Review of Psychology 60, 607−629. Thistlethwaite, D. L., and Campbell, D. T. 1960. Regression-discontinuity analysis: An alternative to the ex post facto experiment. Journal of Educational Psychology 51(6), 309– 317. White, Douglas R. 2015. Oscillatory Complexity in Human History: Earth’s asymmetric biogeography and ethnographic data. Conference Paper CS-DC’15, Session on “Synthesis of Ecological, Biological and Ethnographic Data.” ASU, Arizona. This was a feature of a 2014 publication of the PNAS predicting moral god religion using the data of the Ethnographic Atlas (Murdock 1967). The direction of this article is to explore the use of Bayesian learning graphs, in an early stage of analysis, using the more extensive dataset of the Standard Cross-Cultural Sample (Murdock and White 1967). Our variables SuperjhWriting and AnimXbwealth substitute for the PNAS variables; in doing so they do not multiply but rather distinguish a dominant ecological region of pastoralism, associated with Islam as a major Moral God religion from the large states and empires present in both the Islamic and the Christian regions of West Eurasia and North Africa. 2 The Wy variable is thus intended to be endogenous by definition. 1 7