Document

advertisement
Introduction to and Overview of DEf
An R software package for cross-cultural research
E. Anthon Eff
Malcolm M. Dow
Wes Routon
Anthropological Sciences Conference, Albuquerque, March 18, 2014
The two major problems with cross-cultural
data analysis addressed by DEf are:
Missing Data
All of the major cross-cultural data sets have substantial missing data. Single imputation methods
– mean substitution, regression predicted scores, hot deck, etc. – result in coefficient variance
estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion,
generally result in small samples (loss of power) and also require very strong assumptions about
why data are missing. These assumptions are very unlikely to hold. Single imputation methods
are no longer recommended.
DEf employs the Multiple Imputation by Chained Equations (mice) approach to handling missing
data.
Non-Independence of Sample Units
Sample cases in cross-cultural and cross-national data are frequently not independent of one
another due to various inter-societal network processes: cultural trait borrowing, conquest,
emulation, inheritance from ancestral populations, etc. This is the classic Galton’s Problem in
anthropology, understood more generally as the problem of cultural trait transmission.
DEf addresses this issue by incorporating networks of relations into regression models, and
employing instrumental variables procedures to generate consistent and relatively efficient
estimates.
First problem: Missing Data
Society
Nama Hottentot
Kung Bushmen
Thonga
Lozi
Mbundu
Suku
markin markout money commland sharefood
NA
NA
1
NA
NA
1
4
1
3
6
4
4
3
3
6
3
3
1
3
NA
NA
NA
4
NA
NA
2
2
4
2
2
Two solutions:
1. Listwise deletion
2. Multiple imputation
Listwise deletion
Society
Nama Hottentot
Kung Bushmen
Thonga
Lozi
Mbundu
Suku
markin markout money commland sharefood
NA
NA
1
NA
NA
1
4
1
3
6
4
4
3
3
6
3
3
1
3
NA
NA
NA
4
NA
NA
2
2
4
2
2
• Lose three observations. Lose all of the information in the cells marked in
red.
• Of 186 societies, 156 would have been dropped using listwise deletion.
No longer testing against the full range of human societies. Losing the big
advantage of the SCCS. Probable sample selection bias.
Multiple imputation
Society
Nama Hottentot
Kung Bushmen
Thonga
Lozi
Mbundu
Suku
markin markout money commland sharefood
3
4
1
2
3
1
4
1
3
6
4
4
3
3
6
3
3
1
3
6
4
5
4
3
1
2
2
4
2
2
Society
Nama Hottentot
Kung Bushmen
Thonga
Lozi
Mbundu
Suku
Replace missing values with
imputed values, drawn from
conditional distribution. Create
several (5 to 10) new data sets
with imputed values.
markin markout money commland sharefood
2
3
1
1
2
1
4
1
3
6
4
4
3
3
6
3
3
1
3
4
3
5
4
5
3
2
2
4
2
2
Society
Nama Hottentot
Kung Bushmen
Thonga
Lozi
Mbundu
Suku
markin markout money commland sharefood
3
5
1
2
3
1
4
1
3
6
4
4
3
3
6
3
3
1
3
5
2
6
4
4
2
2
2
4
2
2
Step 1 of the DEf Approach to Multiple Imputation of Missing
Data: finding auxiliary variables.
The mice procedure imputes values for missing observations on the variables specified in the structural
regression model of interest, using both these variables themselves plus a set of auxiliary variables.
Ideal auxiliary variables are usually a subset of those with no missing values in the full data set.
Auxiliary variables must be correlated with the variables in the structural regression model that have missing
values, since the imputation procedure is designed to “borrow” information from them to help impute the
missing values.
DEf will employ auxiliary variables provided by the user. Alternatively, DEf will identify suitable auxiliary
variables as follows:
1. identify all categorical, ordinal, interval variables with no missing values in the complete data set.
2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable:
i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the
covariate that provides the highest correlation, and save the residual
ii) add to the regression model the covariate that correlates highest with the residual, and
calculate the new residual
iii) repeat the above steps 8 times (or more)
iv) calculate the relative importance of predictors, drop variables that fall below a given
threshold, and recalculate the residual
v) repeat steps ii – iv.
Step 2: Create m complete data sets
• The mice procedure is repeated m times to create m copies of
the data set, each containing different sets of imputed values.
• Since each data set is now complete, each can be analyzed
using any of the usual statistical models that require
complete data.
• m = 10 - 100 is currently suggested, depending on sample size
and amounts of missing data.
Step 3: Analyzing the data and pooling the results: Rubin’s
Rules
Separate analyses of m multiply imputed samples generates m estimates of any statistic of
interest. In the general case, for any statistic Q an analysis of m data sets yields Qˆ ( j ) and U ( j )
estimates of the statistic and its variance for the jth data set (j = 1, 2,….,m). The multiple
imputation point estimate of each parameter Q is simply the mean of the m estimates:
m
Q   Qˆ ( j ) m
j 1
Analyzing the data and pooling the results, cont….
To calculate the variance of this estimate, both the m estimated variances of each Qˆ ( j ) and the
variance in the U ( j ) across the m estimations must be combined. First, the mean of the m
estimated variances for each parameter is obtained as a simple average:
m
W  U ( j ) m
j 1
This quantity is known as the within-imputation variance.
Analyzing the data and pooling the results, cont….
Next, the variance in the m estimated values Qˆ ( j ) is calculated as:
m


2
B   Qˆ ( j )  Q m  1
j 1
This quantity is known as the between-imputation variance. These two variances are then
combined to get the total variance in the combined estimate of Q:
m  1 B
T W 
m
Analyzing the data and pooling the results, cont….
Rubin (1987: 79) shows that the following quantity is approximately distributed as a tdistribution
where the degrees of freedom, df, is given by

mW 
df  (m  1)1 



m

1
B


2
Analyzing the data and pooling the results, cont….
Rubin’s pooling procedures can be done with any statistic generated
by the statistical method employed to analyze the m imputed
data sets.
Galton’s Problem
Incorporating inter-societal networks
into network autocorrelation effects
regression models
Galton’s problem
Observations not independent.
• Common descent (language phylogeny)
• Cultural borrowing (geographic distance)
In regression context, Galton’s problem will
cause biased coefficients and biased standard
errors.
Galton’s problem example:
Hypothesis: Drinking alcohol dampens the libido of religious specialists.
Ecuador
Iran
Ireland
Morocco
Spain
Yemen
alcohol
1
0
1
0
1
0
wives
0
2
0
3
0
4
Pearson correlation= -0.9332565, p-value=0.0065
Adapted from Victor de Munck and Andrey Korotayev. 2000. “Cultural Units in Cross-Cultural Research.“ Ethnology 39(4): 335-348
An observed correlation between a pair of cultural traits across cultures could be
due to the borrowing of the traits, as a package, from a common source (“horizontal
transmission”), or could be due to their transmission, as a package, from a common
ancestor (“vertical transmission”), or could be due to a true functional relationship.
What processes might be inducing non-independence?
 Spatial Diffusion: societies in close proximity have more
opportunity to emulate, conform to, adopt, borrow, etc. neighbors
behaviors, beliefs, customs, rituals… (horizontal diffusion.)
 Language similarity: Similarity due to populations splitting off from
same ancestral population. (vertical diffusion.)
 Religion: Marriage practices spread world-wide by the colonization
of large swaths of the world by European Christian nations.
 Equivalence: units “similarly situated” in a network and not
necessarily proximate. E.g., economic similarity, core/periphery in
world system, colonial status, ecological setting, …
Assessing non-independence: Tobler’s First Law
of Geography
“Everything is related to everything else, but near things
are more closely related than distant things.”
This “law” suggests that the scores on variable y for the ith
society should be similar to the scores of those societies
with which it has the closest relationships. Call these
societies i’s “neighborhood set.”
If so, yi should be similar to the weighted average of the
set of y scores for i’s neighborhood set, where the
weights indicate relative closeness.
If the N scores on y are significantly correlated with the N
weighted average scores, conclude the y variable is auto(self)-correlated.
Weighting sample units.
First , need to construct an NxN connectivity matrix C of pair-wise
relatedness scores among sample units, and then rownormalize C to unity to get the required weights matrix W.
That is, wij = cij ⁄Σjcij.
Raw Connectivity Matrix C
C=
Weights Matrix W
y
Wy
0
1
1
1
0
0
0
0
1/3
1/3
1/3
0
0
0
6
7
1
0
0
1
0
0
0
1/2
0
0
1/2
0
0
0
5
7
1
0
0
1
0
0
0
1/2
0
0
1/2
0
0
0
8
7
0
1
1
0
1
0
0 W=
0
1/3
1/3
0
1/3
0
0
8
5.3
0
0
0
1
0
1
1
0
0
0
1/3
0
1/3
1/3
3
3.3
0
0
0
0
1
0
1
0
0
0
0
1/2
0
1/2
1
2
0
0
0
0
1
1
0
0
0
0
0
1/2
1/2
0
1
2
(If a variable y is premultiplied by W, i.e. Wy, the product will be an Nx1 vector of
weighted averages that are on the same scale as y.)
Incorporating autocorrelated variables into multiple
regression
 Most cross-cultural researchers are usually interested
in testing whether hypothesized predictor variables
are acting on a dependent variable, as well as what
processes are inducing autocorrelation in it.
 The Network Autocorrelation Regression Effects
Models in DEf do just that.
Most commonly used network autocorrelation
regression model is:
Network Autocorrelation Effects model:
y = α + ρWy + Xβ + ε
Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are
related, 0 otherwise, and wii = 0 for all i;
ρ is the network autocorrelation coefficient;
y is an Nx1 vector;
Wy is an Nx1 vector where each element i is a weighted average of y values
for i’s neighborhood set;
X is an Nxk matrix of exogenous variables;
β is an kx1 vector of coefficients;
ε is an Nx1 vector of error terms.
Also called the Network “Lag” model, by analogy to time series, since W acts similarly to
the lag operator in time series models, except that W lags the y variable in other kinds
of social and physical “spaces.”
This is the model currently implemented in DEf
Estimating the network autocorrelation effects
regression model
y = α + ρWy + Xβ + ε
 MLE: Maximum Likelihood Estimation. This is usually the method of
choice. But the log-likelihood function contains the term ln|A|, where A=
(I – ρW). Since A is asymmetric and usually not sparse, finding the
eigenvalues is computationally burdensome for large N. And, for more
than two endogenous Wy variables, the likelihood function is intractable.
 OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s.
variables be independent of (uncorrelated with) the error term ε. If not,
all coefficient estimates (ρ and β) are biased and inconsistent. Here, y is
by definition a function of ε, so Wy is also a function of ε. That is, Cov(Wy,
ε) ≠ 0. Wy is thus an endogenous regressor.
 IV: Instrumental Variables (IV). Provides a way to obtain consistent
parameter estimates for models with endogenous variables. 2SLS is an IV
estimation procedure. Can deal with large samples and multiple
endogenous variables. DEf uses IV estimation procedures.
An “intuitive” view of the IV regression approach
OLS model:
y = α + ρWy + ε
ε
Z
Wy
y
Z is an instrument for Wy if
Cov(Z,ε) = 0 (Z is valid) and Cov(Z,Wy) ≠ 0 (Z is relevant).
So, need to find an additional variable(s) Z that is correlated with Wy
but uncorrelated with ε to serve as an instrument for Wy.
An “intuitive” view of the 2SLS IV estimation procedure
Consider again the network effects model
y = α + ρWy + Xβ + ε
Suppose we use WX, the lagged values of X, as an instrument for Wy.
Step 1. Using OLS, estimate
Save the predicted scores
Wy = a + WXc + υ
ŷw = â + WXĉ
Step 2. Again using OLS, estimate y = α + ρ ŷw + Xβ + ε
(Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1step procedures used in all the usual software packages.)
2SLS Estimation of the network autocorrelation effects
regression model with IVs: general case
y = α + Xβ + ε
Where to get appropriate instruments?
 Usually, it’s hard to find additional variables that meet the
conditions required. Variables that affect the endogenous
variable(s) are often also likely to affect the dependent
variable.
 Kelejian and Prucha (1998) show that the set of {WX, W2X,
W3X,…} variables are optimal as instruments for Wy, where
W2, W3,…. are the 2-step and 3-step connections between
sample units. In practice, the WX variables or some subset of
them will usually be sufficient.
Evaluating the quality of the instrumental variables
Quality of 2SLS estimators depends on the quality of the IVs.
Require that
 Cov(Z,ε) = 0. IVs must be valid. IV estimation is vulnerable on
this point. Tests are available only if there are more instruments
than endogenous variables (overidentification.)
 IVs also need to be relevant. i.e., they should predict
endogenous variables independently of other exogenous
variables. Shea (1997) proposed a partial R2 measure of
instrument relevance for multiple endogenous variable models.
 Marginal associations between endogenous variable(s) and Z is
known as the “weak” instruments problem. Some diagnostics
are available.
 No perfect collinearity between all exogenous variables.
Overidentification tests
 If there is more than 1 instrumental variable available for Wy, can
test the null hypothesis that at least one of them is correlated with
the errors.
 Sargan (1958) is the best known test:
Ts = NR2u ~ χ2
(with df = #IVs - #endogenous variables)
where R2u is the R2 of OLS regression of 2SLS residuals on the IVs.
 Basmann (1960) provides an alternate, though similar, test.
 Kirby and Bollen (2009) discuss additional variants of Sargan and
Basmann in the context of SEM.
“Weak” Instruments
 Bound et al (1995) show that when the instruments are only
weakly correlated with the endogenous variables IV estimates
are biased in the same direction as OLS estimates, and may be
more biased than OLS. In addition, weak IV regression
estimates may not be consistent.
 Staiger and Stock (1997) suggest that the partial F-statistic
from the increase in the regression R2 after adding the
auxiliary instruments to the exogenous variables in the first
stage regression should be greater than 10.
 Stock and Yogo (2005) provide tables that give some guidance
as to how much greater than 10 the F-statistic may have to be.
Example: Monogamy in the Pre-industrial World
Multiple proposed determinants of the long-term
historical shift in marriage preference from
polygynous to monogamous unions are tested using
data from the Standard Cross-Cultural Sample.
Determinants of Monogamy (adapted from Dow and Eff 2013)
Theoretical perspective
Primary Sources
Determinants (expected sign)
Males provide essential resources
Orions 1969; Borgerhoff-Mulder et al 1990;
Marlowe 2000; Low 2003; Alexander et al
1979
male resource inequality (-), female
economic contribution (-), beneficial
natural environment (-)
Female intra-sexual aggression
Gowaty 1996; Reichard 2003
Endemic violence (-)
Male intra-sexual aggression
Emlen & Oring 1977; Hawkes et al 1995;
endemic violence (-), social control (-)
Marlowe 2000; Borgerhoff-Mulder 1990;
Quinlan and Quinlan 2007; van Schaik and
Dunbar 1990; Wrangham et al 1999; Ember &
Ember 1992.
Extrinsic Risk
Quinland and Quinlan 2007; Del Guidice 2009; pathogen stress (-)
Low 1988, 1990, 2003, 2007
Agent level perspectives
Group-level processes
Collective action in small-scale
societies
Olson 1971; Alexander et al 1979; Price 1999; (Inverse of)societal scale (-)
Betzig 1986
Socially Imposed Monogamy (SIM)
Alexander et al 1979; Betzig 1986
societal scale (+)
Cultural Trait Transmission
Divale and Seda 2001; Dow and Eff 2009;
Herlihy 2005; Price 1999
Distance (+), language (+), modernization
(+)
W matrices employed
 Geographical Distance:
the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2
Use only the nearest 20 societies.
 Language similarity:
the WL matrix is described in Eff (2008), where cij = e-score(ij)
If the Ws are collinear, can combine them into a single matrix:
WDL = πDWD + πLWL
where 0 ≤ πD, πL ≤ 1 and πD + πL =1
Then, run all combinations of WDL and select as “best” the matrix that maximizes R2iv
Also obtain information on the weights that yield the “best” combined W.
2SLS estimation of network autocorrelation regression model using composite
distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of
married females in monogamous marriages [monofem (λ – 1)/λ) ]
Variable
Description
Std coef p-value VIF
Unrestricted model
Wy
modern
pathstr
violence
environ
femecon
techlev
socont
resineq
socscale
network lag term
modernization
pathogen stress
intra-societal violence
beneficient environment
female economic contribution
technological level
social control over sexual relations
resource inequality
scale of society
R2 = 0.466
Restricted model
0.354
0.166
-0.253
-0.165
0.217
-0.216
0.141
0.093
0.054
-0.008
0.000
0.026
0.006
0.034
0.007
0.002
0.063
0.286
0.631
0.929
1.463
1.113
1.721
1.155
1.330
1.097
1.606
1.232
2.040
2.074
Std coef p-value partitioned R2
Wy
network lag term
0.361 0.000
0.178
modern
modernization
0.173 0.020
0.038
pathstr
violence
femecon
environ
techlev
pathogen stress
intra-societal violence
female economic contribution
beneficient environment
technological level
-0.257
-0.155
-0.208
0.188
0.179
0.099
0.043
0.052
0.018
0.026
R2 = 0.453
Restricted model F-statistics and p-values on diagnostics tests
0.003
0.044
0.003
0.011
0.008
F-stat p-value
Hausman
H0: Wy exogenous
4.242 0.040
Ramsey RESET
H0: model correct functional form
0.269 0.604
Bresuch-Pagan
H0: residuals homoskedastic
2.678 0.102
Wald-restrictions
H0: dropped variables have coef=0
0.494 0.483
Shapiro-Wilkes
H0: residuals normally distributed
1.826 0.177
LM error (geographic) H0: residuals not autocorrelated
2.353 0.125
LM error (language)
H0: residuals not autocorrelated
0.043 0.836
LM error (ecological) H0: residuals not autocorrelated
0.690 0.406
Sargan Test
H0: residuals uncorrelated with IVs
0.459 0.498
Steiger-Stock weak IVs F = 15.00
Notes: Dependent variable is monofema(λ -1)/ λ (Box-Cox transformation), where λ=4.157. Coefficient p-values
from bootstrap standard errors (1,000 replications). All estimations from multiply imputed (m=15) data; only
observations non-missing for the dependent variable (N=143) are used in the m regressions. Composite
matrix weights: distance=0.78, language=0.22.
Summary:
• DEf is a new statistical package designed for cross-cultural
and cross-national data sets.
• Given the ubiquity of missing data in such data sets, DEf
includes a suite of programs for multiple imputation of
missing data
• Given that sample units in comparative data sets are nonindependent due to various processes of cultural trait
diffusion, DEf includes a suite of programs to implement
network autocorrelation effects models.
• Available as R workspace and on XSEDE CoSSci/DEf Science
Gateway.
Download