Introduction to and Overview of DEF2 An R software package for cross-cultural research E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences Conference, Albuquerque, March 18, 2014 The two major problems with cross-cultural data analysis addressed by DEF2 are: Missing Data All of the major cross-cultural data sets have substantial missing data. Single imputation methods – mean substitution, regression predicted scores, hot deck, etc. – result in coefficient variance estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion, generally result in small samples (loss of power) and also require very strong assumptions about why data are missing. These assumptions are very unlikely to hold. Single imputation methods are no longer recommended. DEF 2 employs the Multiple Imputation by Chained Equations (mice) approach to handling missing data. Non-Independence of Sample Units Sample cases in cross-cultural and cross-national data are frequently not independent of one another due to various inter-societal network processes: cultural trait borrowing, conquest, emulation, inheritance from ancestral populations, etc. This is the classic Galton’s Problem in anthropology, understood more generally as the problem of cultural trait transmission. DEF2 addresses this issue by incorporating networks of relations into regression models, and employing instrumental variables procedures to generate consistent and relatively efficient estimates. Step 1 of the DEF2 Approach to Multiple Imputation of Missing Data: finding auxiliary variables. The mice procedure imputes values for missing observations on the variables specified in the structural regression model of interest, using both these variables themselves plus a set of auxiliary variables. Ideal auxiliary variables are usually a subset of those with no missing values in the full data set. Auxiliary variables must be correlated with the variables in the structural regression model that have missing values, since the imputation procedure is designed to “borrow” information from them to help impute the missing values. DEF2 will employ auxiliary variables provided by the user. Alternatively, DEF2 will identify suitable auxiliary variables as follows: 1. identify all categorical, ordinal, interval variables with no missing values in the complete data set. 2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable: i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the covariate that provides the highest correlation, and save the residual ii) add to the regression model the covariate that correlates highest with the residual, and calculate the new residual iii) repeat the above steps 8 times (or more) iv) calculate the relative importance of predictors, drop variables that fall below a given threshold, and recalculate the residual v) repeat steps ii – iv. Step 2: Create m complete data sets • The mice procedure is repeated m times to create m copies of the data set, each containing different sets of imputed values. • Since each data set is now complete, each can be analyzed using any of the usual statistical models that require complete data. • m = 10 - 100 is currently suggested, depending on sample size and amounts of missing data. Step 3: Analyzing the data and pooling the results: Rubin’s Rules Separate analyses of m multiply imputed samples generates m estimates of any statistic of interest. In the general case, for any statistic Q an analysis of m data sets yields Qˆ ( j ) and U ( j ) estimates of the statistic and its variance for the jth data set (j = 1, 2,….,m). The multiple imputation point estimate of each parameter Q is simply the mean of the m estimates: m Q Qˆ ( j ) m j 1 Analyzing the data and pooling the results, cont…. To calculate the variance of this estimate, both the m estimated variances of each Qˆ ( j ) and the variance in the U ( j ) across the m estimations must be combined. First, the mean of the m estimated variances for each parameter is obtained as a simple average: m W U ( j ) m j 1 This quantity is known as the within-imputation variance. Analyzing the data and pooling the results, cont…. Next, the variance in the m estimated values Qˆ ( j ) is calculated as: m 2 B Qˆ ( j ) Q m 1 j 1 This quantity is known as the between-imputation variance. These two variances are then combined to get the total variance in the combined estimate of Q: m 1 B T W m Analyzing the data and pooling the results, cont…. Rubin (1987: 79) shows that the following quantity is approximately distributed as a tdistribution where the degrees of freedom, df, is given by mW df (m 1)1 m 1 B 2 Analyzing the data and pooling the results, cont…. Rubin’s pooling procedures can be done with any statistic generated by the statistical method employed to analyze the m imputed data sets. Galton’s Problem Incorporating inter-societal networks into network autocorrelation effects regression models What processes might be inducing non-independence? Spatial Diffusion: societies in close proximity have more opportunity to emulate, conform to, adopt, borrow, etc. neighbors behaviors, beliefs, customs, rituals… (horizontal diffusion.) Language similarity: Similarity due to populations splitting off from same ancestral population. (vertical diffusion.) Religion: Marriage practices spread world-wide by the colonization of large swaths of the world by European Christian nations. Equivalence: units “similarly situated” in a network and not necessarily proximate. E.g., economic similarity, core/periphery in world system, colonial status, ecological setting, … Assessing non-independence: Tobler’s First Law of Geography “Everything is related to everything else, but near things are more closely related than distant things.” This “law” suggests that the scores on variable y for the ith society should be similar to the scores of those societies with which it has the closest relationships. Call these societies i’s “neighborhood set.” If so, yi should be similar to the weighted average of the set of y scores for i’s neighborhood set, where the weights indicate relative closeness. If the N scores on y are significantly correlated with the N weighted average scores, conclude the y variable is auto(self)-correlated. Weighting sample units. First , need to construct an NxN connectivity matrix C of pair-wise relatedness scores among sample units, and then rownormalize C to unity to get the required weights matrix W. That is, wij = cij ⁄Σjcij. Raw Connectivity Matrix C C= Weights Matrix W y Wy 0 1 1 1 0 0 0 0 1/3 1/3 1/3 0 0 0 6 7 1 0 0 1 0 0 0 1/2 0 0 1/2 0 0 0 5 7 1 0 0 1 0 0 0 1/2 0 0 1/2 0 0 0 8 7 0 1 1 0 1 0 0 W= 0 1/3 1/3 0 1/3 0 0 8 5.3 0 0 0 1 0 1 1 0 0 0 1/3 0 1/3 1/3 3 3.3 0 0 0 0 1 0 1 0 0 0 0 1/2 0 1/2 1 2 0 0 0 0 1 1 0 0 0 0 0 1/2 1/2 0 1 2 (If a variable y is premultiplied by W, i.e. Wy, the product will be an Nx1 vector of weighted averages that are on the same scale as y.) Incorporating autocorrelated variables into multiple regression Most cross-cultural researchers are usually interested in testing whether hypothesized predictor variables are acting on a dependent variable, as well as what processes are inducing autocorrelation in it. The Network Autocorrelation Regression Effects Models in DEF2 do just that. Most commonly used network autocorrelation regression model is: Network Autocorrelation Effects model: y = α + ρWy + Xβ + ε Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are related, 0 otherwise, and wii = 0 for all i; ρ is the network autocorrelation coefficient; y is an Nx1 vector; Wy is an Nx1 vector where each element i is a weighted average of y values for i’s neighborhood set; X is an Nxk matrix of exogenous variables; β is an kx1 vector of coefficients; ε is an Nx1 vector of error terms. Also called the Network “Lag” model, by analogy to time series, since W acts similarly to the lag operator in time series models, except that W lags the y variable in other kinds of social and physical “spaces.” This is the model currently implemented in DEF2 Estimating the network autocorrelation effects regression model y = α + ρWy + Xβ + ε MLE: Maximum Likelihood Estimation. This is usually the method of choice. But the log-likelihood function contains the term ln|A|, where A= (I – ρW). Since A is asymmetric and usually not sparse, finding the eigenvalues is computationally burdensome for large N. And, for more than two endogenous Wy variables, the likelihood function is intractable. OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s. variables be independent of (uncorrelated with) the error term ε. If not, all coefficient estimates (ρ and β) are biased and inconsistent. Here, y is by definition a function of ε, so Wy is also a function of ε. That is, Cov(Wy, ε) ≠ 0. Wy is thus an endogenous regressor. IV: Instrumental Variables (IV). Provides a way to obtain consistent parameter estimates for models with endogenous variables. 2SLS is an IV estimation procedure. Can deal with large samples and multiple endogenous variables. DEF2 uses IV estimation procedures. An “intuitive” view of the IV regression approach OLS model: y = α + ρWy + ε ε Z Wy y Z is an instrument for Wy if Cov(Z,ε) = 0 (Z is valid) and Cov(Z,Wy) ≠ 0 (Z is relevant). So, need to find an additional variable(s) Z that is correlated with Wy but uncorrelated with ε to serve as an instrument for Wy. An “intuitive” view of the 2SLS IV estimation procedure Consider again the network effects model y = α + ρWy + Xβ + ε Suppose we use WX, the lagged values of X, as an instrument for Wy. Step 1. Using OLS, estimate Save the predicted scores Wy = a + WXc + υ ŷw = â + WXĉ Step 2. Again using OLS, estimate y = α + ρ ŷw + Xβ + ε (Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1step procedures used in all the usual software packages.) 2SLS Estimation of the network autocorrelation effects regression model with IVs: general case y = α + Xβ + ε Where to get appropriate instruments? Usually, it’s hard to find additional variables that meet the conditions required. Variables that affect the endogenous variable(s) are often also likely to affect the dependent variable. Kelejian and Prucha (1998) show that the set of {WX, W2X, W3X,…} variables are optimal as instruments for Wy, where W2, W3,…. are the 2-step and 3-step connections between sample units. In practice, the WX variables or some subset of them will usually be sufficient. Evaluating the quality of the instrumental variables Quality of 2SLS estimators depends on the quality of the IVs. Require that Cov(Z,ε) = 0. IVs must be valid. IV estimation is vulnerable on this point. Tests are available only if there are more instruments than endogenous variables (overidentification.) IVs also need to be relevant. i.e., they should predict endogenous variables independently of other exogenous variables. Shea (1997) proposed a partial R2 measure of instrument relevance for multiple endogenous variable models. Marginal associations between endogenous variable(s) and Z is known as the “weak” instruments problem. Some diagnostics are available. No perfect collinearity between all exogenous variables. Overidentification tests If there is more than 1 instrumental variable available for Wy, can test the null hypothesis that at least one of them is correlated with the errors. Sargan (1958) is the best known test: Ts = NR2u ~ χ2 (with df = #IVs - #endogenous variables) where R2u is the R2 of OLS regression of 2SLS residuals on the IVs. Basmann (1960) provides an alternate, though similar, test. Kirby and Bollen (2009) discuss additional variants of Sargan and Basmann in the context of SEM. “Weak” Instruments Bound et al (1995) show that when the instruments are only weakly correlated with the endogenous variables IV estimates are biased in the same direction as OLS estimates, and may be more biased than OLS. In addition, weak IV regression estimates may not be consistent. Staiger and Stock (1997) suggest that the partial F-statistic from the increase in the regression R2 after adding the auxiliary instruments to the exogenous variables in the first stage regression should be greater than 10. Stock and Yogo (2005) provide tables that give some guidance as to how much greater than 10 the F-statistic may have to be. Example: Monogamy in the Pre-industrial World Multiple proposed determinants of the long-term historical shift in marriage preference from polygynous to monogamous unions are tested using data from the Standard Cross-Cultural Sample. Determinants of Monogamy (adapted from Dow and Eff 2013) Theoretical perspective Primary Sources Determinants (expected sign) Males provide essential resources Orions 1969; Borgerhoff-Mulder et al 1990; Marlowe 2000; Low 2003; Alexander et al 1979 male resource inequality (-), female economic contribution (-), beneficial natural environment (-) Female intra-sexual aggression Gowaty 1996; Reichard 2003 Endemic violence (-) Male intra-sexual aggression Emlen & Oring 1977; Hawkes et al 1995; endemic violence (-), social control (-) Marlowe 2000; Borgerhoff-Mulder 1990; Quinlan and Quinlan 2007; van Schaik and Dunbar 1990; Wrangham et al 1999; Ember & Ember 1992. Extrinsic Risk Quinland and Quinlan 2007; Del Guidice 2009; pathogen stress (-) Low 1988, 1990, 2003, 2007 Agent level perspectives Group-level processes Collective action in small-scale societies Olson 1971; Alexander et al 1979; Price 1999; (Inverse of)societal scale (-) Betzig 1986 Socially Imposed Monogamy (SIM) Alexander et al 1979; Betzig 1986 societal scale (+) Cultural Trait Transmission Divale and Seda 2001; Dow and Eff 2009; Herlihy 2005; Price 1999 Distance (+), language (+), modernization (+) W matrices employed Geographical Distance: the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2 Use only the nearest 20 societies. Language similarity: the WL matrix is described in Eff (2008), where cij = e-score(ij) If the Ws are collinear, can combine them into a single matrix: WDL = πDWD + πLWL where 0 ≤ πD, πL ≤ 1 and πD + πL =1 Then, run all combinations of WDL and select as “best” the matrix that maximizes R2iv Also obtain information on the weights that yield the “best” combined W. 2SLS estimation of network autocorrelation regression model using composite distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of married females in monogamous marriages [monofem (λ – 1)/λ) ] Variable Description Std coef p-value VIF Unrestricted model Wy modern pathstr violence environ femecon techlev socont resineq socscale network lag term modernization pathogen stress intra-societal violence beneficient environment female economic contribution technological level social control over sexual relations resource inequality scale of society R2 = 0.466 Restricted model 0.354 0.166 -0.253 -0.165 0.217 -0.216 0.141 0.093 0.054 -0.008 0.000 0.026 0.006 0.034 0.007 0.002 0.063 0.286 0.631 0.929 1.463 1.113 1.721 1.155 1.330 1.097 1.606 1.232 2.040 2.074 Std coef p-value partitioned R2 Wy network lag term 0.361 0.000 0.178 modern modernization 0.173 0.020 0.038 pathstr violence femecon environ techlev pathogen stress intra-societal violence female economic contribution beneficient environment technological level -0.257 -0.155 -0.208 0.188 0.179 0.099 0.043 0.052 0.018 0.026 R2 = 0.453 Restricted model F-statistics and p-values on diagnostics tests 0.003 0.044 0.003 0.011 0.008 F-stat p-value Hausman H0: Wy exogenous 4.242 0.040 Ramsey RESET H0: model correct functional form 0.269 0.604 Bresuch-Pagan H0: residuals homoskedastic 2.678 0.102 Wald-restrictions H0: dropped variables have coef=0 0.494 0.483 Shapiro-Wilkes H0: residuals normally distributed 1.826 0.177 LM error (geographic) H0: residuals not autocorrelated 2.353 0.125 LM error (language) H0: residuals not autocorrelated 0.043 0.836 LM error (ecological) H0: residuals not autocorrelated 0.690 0.406 Sargan Test H0: residuals uncorrelated with IVs 0.459 0.498 Steiger-Stock weak IVs F = 15.00 Notes: Dependent variable is monofema(λ -1)/ λ (Box-Cox transformation), where λ=4.157. Coefficient p-values from bootstrap standard errors (1,000 replications). All estimations from multiply imputed (m=15) data; only observations non-missing for the dependent variable (N=143) are used in the m regressions. Composite matrix weights: distance=0.78, language=0.22. Summary: • DEF2 is a new statistical package designed for crosscultural and cross-national data sets. • Given the ubiquity of missing data in such data sets, DEF2 includes a suite of programs for multiple imputation of missing data • Given that sample units in comparative data sets are non-independent due to various processes of cultural trait diffusion, DEF2 includes a suite of programs to implement network autocorrelation effects models. • Available ??? Where and How, Anthon and Doug.