Author: Anthon Eff. Version: CCDmanual0dw.docx addesc

advertisement
Author: Anthon Eff. Version: CCDmanual0dw.docx
addesc
Add a variable description to the key file
Description
The function adds a variable description to the key file. This is useful in cases where a new variable is created, whose
description is not yet in the key file. The description is then available for use in doOLS output.
Usage
addesc (nvbs,nvbsdes,dsn=NULL)
Arguments
nvbs
name of variable
nvbsdes description of nvbs
dsn
name of data set variable is based upon (“EA”, “LRB”, “SCCS”, “WNAI”)
Value
The function appends the description to the key file.
Details
Note
Author(s)
Anthon Eff
Anthon.Eff@mtsu.edu
Examples
See under doOLS for complete workflow example.
CSVwrite
Write object to *.csv file
Description
The function writes an object, with elements capable of being coerced to a dataframe, to a csv file. It is used to write the
output from doOLS to a file that can be read by a spreadsheet.
Usage
CSVwrite(a1,a2,a3=FALSE)
Arguments
a1 Object to be written—typically output from function doOLS
a2 The base name of the *.csv file (do not include the “.csv” extension)
a3 Should the object be appended to the existing file (default=FALSE)
Value
No values are returned in the R environment; only changes occur to the specified *.csv file.
Details
Set the option a3=TRUE to append the output of object a1 to an existing file with base name a2. The default will simply
overwrite any existing csv file with base name a2.
Note
Like the write.csv function, except that CSVwrite can append values to an existing csv file, and it can write elements of a list
to a csv file.
Author(s)
1
Anthon Eff
Anthon.Eff@mtsu.edu
Examples
See under doOLS for complete workflow example.
doMI
Produce multiple imputed data sets
Description
The function produces multiple imputed data sets from SCCS data, using methods from the mice package.
Usage
smi<-doMI(EAvs=NULL,LRBvs=NULL,SCCSvs=NULL,WNAIvs=NULL,nimp=10,maxit=7)
Arguments
EAvs
character string containing names of variables from EA dataset
LRBvs
character string containing names of variables from LRB dataset
SCCSvs character string containing names of variables from SCCS dataset
WNAIvs character string containing names of variables from WNAI dataset
nimp
the number of imputed data sets to create (default=10)
maxit
the number of iterations used to estimate imputed data (default=7).
Value
The function doMI returns a dataframe containing the number of imputed datasets specified by the nimp option. The
datasets are stacked one atop the other, and indexed by the variable “.imp”.
Details
This function imputes several new datasets, using covariates for each variable to create a conditional distribution of
estimates for each missing value, and then replacing the missing value with a draw from the distribution; as a result, each of
the imputed datasets will typically have slightly different values for the estimated cells. The key to successful imputation is
to have good covariates for each variable. The function doMI begins the search for good covariates by grouping each
variable in a cluster of collinear variables. For each cluster, the best covariates are selected from a set of variables with no
missing values, including both network lag variables (based on geographic distance, language, and ecology) and climate and
ecology variables.
The first four arguments are lists of variable names, from the four ethnographic data sets (EA, LRB, SCCS, and WNAI).
These will be the data used in model building. One should include all data one thinks might be useful, but no additional
data, since additional variables will add to the time it takes for the procedure to run. The fifth argument is the number of
imputed datasets to create: between 5 and 10 imputed data sets are considered adequate, but there is no harm in choosing
more; the default is 10. The final argument is the number of iterations to perform in creating each imputed dataset; the
default is 7.
It is not usually necessary to examine the returned dataframe—it is used in estimating the model, but is not in itself that
interesting. Nevertheless, some output is automatically written to the console as it executes, in order to provide some
information about the clusters to which the variables have been assigned, and the covariates selected for each cluster. For
each cluster, the names of the members are printed, along with the method used for imputation (in most cases “pmm”—
predictive mean matching; variables without missing values are indicated by empty quotes). Prefixes “l”, “e”, and “d”
indicate spatial lags for, respectively, linguistic, ecological, and geographic proximity. Additionally, those variables that
could not be imputed, due to perfect multicollinearity, are indicated as each cluster is processed.
Squared terms are then created for those variables with at least three unique values, and with maximum values below 1000.
The squared variables are indicated by the “sq” suffix on the original variable name (e.g., “SCCS.v72sq” is the square of
“SCCS.v72”).
The last step is to identify those variables that are perfectly collinear with a linear combination of other variables—users
should consider dropping some of these, so that the problem of perfect multicollinearity does not crop up during estimation.
Note
Based on the methods proposed by Malcolm M. Dow and E. Anthon Eff.
Author(s)
Anthon Eff
Anthon.Eff@mtsu.edu
2
Examples
See under doOLS for complete workflow example.
doOLS Estimate OLS model on multiply imputed data
Description
The function estimates an unrestricted and restricted OLS model, with network lag term, providing common diagnostics.
Usage
h<-doOLS(smi, depvar, indpv, rindpv=NULL, othexog=NULL, dw=TRUE, lw=TRUE, stepW=FALSE, relimp=FALSE,
slmtests=FALSE)
Arguments
smi
a multiply imputed dataset, created by the function doMI
depvar
the name of the dependent variable (must be in smi)
indpv
the names of the independent variables for the unrestricted model (must be in smi)
rindpv
names of restricted model independent variables (must be in indpv; when default of NULL is executed, the
restricted model independent variables will be the same as the unrestricted model, minus the last variable)
othexog names of additional exogenous variables (must be in smi; will be added to a list of 21 variables; default is NULL)
dw
Should geographic proximity be used in constructing composite weight matrix (default=TRUE)
lw
Should linguistic proximity be used in constructing composite weight matrix (default=TRUE)
stepW
Should stepwise regression be done to show most-selected variables from unrestricted model (default=FALSE)
relimp
Should relative importance be calculated for independent variables of restricted model (default=FALSE)
slmtests Should spatial lag tests be run for the four weight matrices (default=FALSE)
Value
Returns a list with 11 elements:
DependVarb
Identification of dependent variable
URmodel
Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs)
Rmodel
Coefficient estimates from the restricted model
RmodelRobust Coefficient estimates from the restricted model with robust SEs
Diagnostics
Regression diagnostics for the restricted model (RESET test; Wald test on model restrictions; BreuschPagan heteroskedasticity test; Shapiro-Wilkes test for normality of residuals; Hausman tests for
endogeneity of independent variables).
OtherStats
Other statistics: Composite weight matrix weights (see details); R2 for all models (model creating
instrument for network lag term; restricted model; unrestricted model); number of imputations; number of
observations.
DescripStats
Descriptive statistics for variables in unrestricted model.
dfbetas
Influential observations for dfbetas (see details)
totry
Character string of variables that were most significant in the unrestricted model as well as additional
variables that proved significant using the add1 function on the restricted model.
didwell
Character string of variables that were most significant in the unrestricted model.
interacts
Character string of interaction variables that proved significant using the add1 function on the restricted
model.
Details
Users can choose two kinds of proximity/similarity weight matrices for constructing a network lag term: geographic and
linguistic. In most cases, users should choose both (the defaults). The optimal composite weight matrix, constructed as the
weighted sum of the weight matrices, is that which maximizes unrestricted model R 2. The network lag term is entered in
each model as the variable “Wy”.
The dfbetas are scaled changes in coefficient estimates caused by adding an observation to the model. Only the most
influential dfbetas are output.
The stepwise procedure can provide additional insight on which independent variables provide the best model fit. Since the
imputed datasets differ slightly from each other, the variables selected by a stepwise procedure typically differ slightly for
each imputed dataset. If the stepW=TRUE option is chosen, a column labeled “stepkept” will be added to the table reporting
unrestricted model results. The column reports the number of times the independent variable was retained in the model by a
stepwise procedure using both forward and backward selection.
3
The add1 function tests whether the members of a list of variables prove significant when added singly to a model. The list
of variables includes all numeric variables in the imputed dataset, as well as squared terms of variables currently in the
unrestricted regression. Variables proving significant in over 80 percent of the imputations are returned in the character
string “totry”.
Relative importance is a method of assigning R2 to each independent variable. The method repeatedly estimates a model,
first with one independent variable, then with two, etc. and calculates the change in R2 as each variable is introduced. The
order of entry is changed, and the process repeated, to consider all possible orders of entry. The relative importance measure
is the average change in R2 across all these different models. With large numbers of independent variables, the calculations
are prohibitively slow. Setting relimp=TRUE will calculate the relative importance of independent variables in the restricted
model, and report these in the column labeled “relimp”.
Note
Based on the methods proposed by Malcolm M. Dow and E. Anthon Eff.
Author(s)
Anthon Eff
Anthon.Eff@mtsu.edu
Examples
library(mice)
library(foreign)
library(stringr)
library(psych)
library(AER)
library(relaimpo)
library(geosphere)
library(spdep)
# --bring in functions and data-load(url("http://dl.dropbox.com/u/9256203/DEz2.Rdata"),.GlobalEnv)
ls() #-can see the objects contained in DEz2.Rdata
#--list and modify variables for use in model-# --make new variables-xcd$SCCS.valchild<-(xcd$SCCS.v473+xcd$SCCS.v474+xcd$SCCS.v475+xcd$SCCS.v476)
# --create descriptions for new variables-addesc("SCCS.valchild","Degree to which society values children")
addesc("Wy","Network lag term")
# --create new dummy variables-xcd<-cbind(xcd,mkdummy("SCCS","v899",1))
# --identify variables to keep for model building-ev<-c("v30","v78")
lv<-c("group2","hunting","gatherin","fishing","huntfil2",
"war1","reven","nomov","dismov","store","subdiv2")
sv<-c("v1685","v72","v234","v236","v238","v1648","v899d1",
"valchild","v1260","v79","v80","v81","v872","v871")
wv<-c("v284","v285","v286","v288","v289","v135")
# --make imputed data-smi<-doMI(EAvs=ev,LRBvs=lv,SCCSvs=sv,WNAIvs=wv,nimp=5,maxit=5)
names(smi) #--can see which variables are available
smi$LRB.lngroup2<-log(smi$LRB.group2)
xcd$LRB.lngroup2<-log(xcd$LRB.group2)
addesc("LRB.lngroup2","Natural log of LRB.group2")
# --identify role of variables in model-dV<-"LRB.lngroup2"
riv<-uiv<-c("SCCS.v21","WNAI.v135","LRB.revensq","LRB.subdiv2","LRB.war1sq")
h<-doOLS(fff=smi,depvar=dV,indpv=c("SCCS.v1260",uiv),
rindpv=riv,othexog=NULL,dw=TRUE,lw=TRUE,
stepW=TRUE,relimp=TRUE,slmtests=FALSE)
print(h)
# --print output to csv file-CSVwrite(h,"myOutput",FALSE)
4
keyf
keyfile dataset
Description
The data.frame keyf contains information about variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI.
Format
rownames
variable
type
description
NOTmissing
class
nUniqVals
FNOTmissing
Fclass
FnUniqVals
db
levels
Variable names from the data.frame xcd
Variable names as given within the ethnographic dataset ( EA, LRB, SCCS, or WNAI)
Variable type (“ordinal” or “categorical”)
Variable description
Number of non-missing values for variable
Variable class (“character” or “numeric”)
Number of unique data values for variable
For the factor version of the ethnographic dataset: Number of non-missing values for variable
For the factor version of the ethnographic dataset: Variable class (“character”, “factor”, “integer”,
or “numeric”)
For the factor version of the ethnographic dataset: Number of unique data values for variable
Source ethnographic dataset ( EA, LRB, SCCS, or WNAI). GIS data is indicated as “gisX”.
Factor levels for variables defined as factors in the factor version (and with fewer than 20 factor
levels).
Examples
head(keyf)
mkdummy
Make dummy variable and store a description in key file
Description
The function makes a dummy variable from a variable in the data.frame xcd, and creates a description stored in the
data.frame keyf.
Usage
mkdummy(dsn,vv,val)
Arguments
dsn name of an ethnographic dataset (EA, LRB, SCCS, or WNAI)
vv name of a variable from the specified ethnographic dataset
val the value of variable vv for which the dummy equals one.
Value
The function returns a variable named dsn.vvdval, which equals one when xcd$dsn.vv==val, and equals zero otherwise.
Details
The main reason to use this function is that it will automatically append a description for the dummy variable to the key file,
which is then available for use in doOLS output. The description is created using the variable name from the key file and the
description of the value from the “levels” variable in the data.frame keyf.
Note
Author(s)
Anthon Eff
Anthon.Eff@mtsu.edu
Examples
See under doOLS for complete workflow example.
mkwtmat
Make and format three weight matrices for the societies in data.frame xcd
5
Description
The function makes and formats three weight matrices (geographic, linguistic, and ecological) for the societies in data.frame
xcd.
Usage
mkwtmat()
Value
The function returns three matrices:
ddm
eem
llm
Geographic proximity, based on the latitude and longitude fields in data.frame xcd. Each cell is the inverted
squared distance between the row society and column society. The diagonal is set to zero, and then the rows
are normalized so that their sum equals one.
Ecological proximity, based on the Euclidean distance between societies in the 22-dimensional space
defined by 19 climate variables, two altitude variables, and one measure of met primary productivity (all
variables scaled to standard normal before distances are calculated). Each cell is exp(-d), where d is the
distance between the row society and column society. The diagonal is set to zero, and then the rows are
normalized so that their sum equals one.
Linguistic proximity between each row and column society. This matrix is not created, but only row
normalized.
Details
Since the geographic and ecological matrices are relatively fast to compute, but very large, it is more efficient to create them
than to load an already constructed matrix. The linguistic matrix, on the other hand, takes a very long time to compute, but
is small (many fewer unique values) and is therefore loaded with the other data and only row-normalized in this function.
The function is run one time in the doMI function, making the matrices available both in the function and in the general
environment.
Note
Author(s)
Anthon Eff
Anthon.Eff@mtsu.edu
Examples
See under doOLS for complete workflow example.
xcd
Cross cultural dataset
Description
The data.frame xcd contains the variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI. The number of
societies represented in each of the datasets is 1267 (EA), 339 (LRB), 186 (SCCS), and 172 (WNAI), for a total of 1964
records in the four datasets. However, some societies appear in more than one dataset (1090 appear only in one; 257 appear
in two; 108 appear in three; and nine appear in all four), so there are 1464 unique societies. The data.frame xcd therefore
contains 1464 observations and 2916 variables: 111 from EA; 262 from LRB; 2055 from SCCS; 440 from WNAI; and 48
that are drawn from GIS data.
Format
For each variable drawn from an ethnographic dataset, the variable name is “XX.vv” where “XX” is the name of the
ethnographic dataset, and “vv” is the name of the variable in that dataset. For example, variable “v207” from SCCS is
names “SCCS.v207”.
Examples
dim(xcd)
6
Selected output
[1]
[1]
[1]
[1]
[1]
[2]
[3]
[4]
[5]
[6]
[1]
[1]
[1]
[1]
"--make eem (ecological similarity weight matrix)----"
"--make ddm (geographic proximity weight matrix)----"
"--format llm (linguistic proximity weight matrix)----"
"--assembling data to be used in imputation--"
"Please check the codebook to see if 'EA.v11' is ordinal"
"Please check the codebook to see if 'LRB.systate3' is ordinal"
"Please check the codebook to see if 'LRB.fres1' is ordinal"
"Please check the codebook to see if 'WNAI.v289' is ordinal"
"Please check the codebook to see if 'WNAI.v308' is ordinal"
"Please check the codebook to see if 'SCCS.v69' is ordinal"
"--making spatially lagged variables for use as covariates--"
"--using cluster analysis to organize variables in similar groups--"
"---Multiple imputation begins---"
"--Cluster--1--"
EA.v30 LRB.systate3
LRB.density
WNAI.v284
WNAI.v286
WNAI.v288
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
WNAI.v289
WNAI.v216
SCCS.v234
SCCS.v236
SCCS.v80
SCCS.v1130
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
dSCCS.v236
dEA.v30
bio.8 lLRB.density
bio.7
eSCCS.v236
""
""
""
""
""
""
lEA.v30 dLRB.systate3
""
""
it im co dep
meth
out
1 0 0 0
collinear SCCS.v234
[1] "--Cluster--2--"
EA.v78
EA.v11
EA.v1
EA.v2
LRB.hunting LRB.gatherin
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
LRB.huntfil2
LRB.reven
LRB.nomov
LRB.dismov
LRB.subdiv2
WNAI.v285
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
WNAI.v204
WNAI.v211
SCCS.v1685
SCCS.v72
SCCS.v899d1 SCCS.valchild
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
SCCS.v1260
SCCS.v79
SCCS.v872
SCCS.v871
SCCS.v69
bio.3
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
""
bio.14
bio.16
bio.10
bio.15
bio.2
bio.19
""
""
""
""
""
""
bio.11
""
[1] "--Cluster--3--"
EA.v31
EA.v4
EA.v5
WNAI.v193
SCCS.v238 SCCS.v1122
dEA.v4
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
""
lEA.v31
lEA.v4
eEA.v4 eSCCS.v1122 dSCCS.v238
eEA.v31
""
""
""
""
""
""
[1] "--Cluster--4--"
EA.v3 LRB.group2 LRB.fishing
LRB.war1
LRB.store
LRB.fres1
LRB.tlpop
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
WNAI.v197
WNAI.v308 SCCS.v1648
SCCS.v81
SCCS.v21
bio.2
bio.17
"pmm"
"pmm"
"pmm"
"pmm"
"pmm"
""
""
bio.12
bio.19
bio.10
mnnpp
bio.1
""
""
""
""
""
[1] "--checking for perfect multicollinearity among variables--"
[1] "--creating squared terms--"
Time difference of 8.670014 mins
dEA.v31
""
WNAI.v135
"pmm"
bio.14
""
$DependVarb
[1] "Dependent variable='LRB.lngroup2': Natural log of LRB.group2"
$URmodel
desc
(Intercept)
<NA>
Wy
Network lag term
7
SCCS.v1260
Total Pathogen Stress
SCCS.v21
Food Surplus via Storage
WNAI.v135
Quantity of fish available in tribal territory: Average annual production in pounds per
square mile of territory
LRB.revensq
Unevenness in rainfall across seasons; (Equation: 4.04);
(Binford 2001:70)--squared
LRB.subdiv2
Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") );
(Binford 2001:403,fn2)
LRB.war1sq
Scale of intensity of warfare. How frequent and how widespread it may be
regionally.--squared
coef stdcoef pvalue star
VIF stepkept
(Intercept) 0.787672173
NA 0.193
NA
5
Wy
0.882333522
0.462 0.000 *** 2.508
5
SCCS.v1260 -0.005695942 -0.029 0.625
1.372
0
SCCS.v21
0.035928589
0.035 0.506
1.096
2
WNAI.v135
-0.011789286 -0.039 0.532
1.521
0
LRB.revensq 0.016588887
0.084 0.057
* 1.143
5
LRB.subdiv2 -0.007208441 -0.109 0.012
** 1.122
5
LRB.war1sq
0.038152870
0.261 0.000 *** 1.658
5
$Rmodel
desc
(Intercept)
<NA>
Wy
Network lag term
SCCS.v21
Food Surplus via Storage
WNAI.v135
Quantity of fish available in tribal territory: Average annual production in pounds per
square mile of territory
LRB.revensq
Unevenness in rainfall across seasons; (Equation: 4.04);
(Binford 2001:70)--squared
LRB.subdiv2
Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") );
(Binford 2001:403,fn2)
LRB.war1sq
Scale of intensity of warfare. How frequent and how widespread it may be
regionally.--squared
coef stdcoef pvalue star
VIF relimp
(Intercept) 0.632860743
NA 0.221
NA
NA
Wy
0.907502623
0.475 0.000 *** 2.178 0.232
SCCS.v21
0.036305461
0.036 0.503
1.086 0.015
WNAI.v135
-0.010477738 -0.034 0.562
1.491 0.032
LRB.revensq 0.016393294
0.083 0.056
* 1.132 0.004
LRB.subdiv2 -0.007278972 -0.110 0.010
** 1.099 0.030
LRB.war1sq
0.037348487
0.255 0.000 *** 1.612 0.144
$RmodelRobust
desc
(Intercept)
<NA>
Wy
Network lag term
SCCS.v21
Food Surplus via Storage
WNAI.v135
Quantity of fish available in tribal territory: Average annual production in pounds per
square mile of territory
LRB.revensq
Unevenness in rainfall across seasons; (Equation: 4.04);
(Binford 2001:70)--squared
LRB.subdiv2
Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") );
(Binford 2001:403,fn2)
LRB.war1sq
Scale of intensity of warfare. How frequent and how widespread it may be
regionally.--squared
coef stdcoef pvalue star
VIF relimp
(Intercept) 0.632860743
NA 0.229
NA
NA
Wy
0.907502623
0.475 0.000 *** 2.178 0.232
SCCS.v21
0.036305461
0.036 0.499
1.086 0.015
WNAI.v135
-0.010477738 -0.034 0.546
1.491 0.032
LRB.revensq 0.016393294
0.083 0.013
** 1.132 0.004
LRB.subdiv2 -0.007278972 -0.110 0.013
** 1.099 0.030
LRB.war1sq
0.037348487
0.255 0.000 *** 1.612 0.144
8
$Diagnostics
Fstat
df pvalue star
RESET test. H0: model has correct functional form -0.079
28.074 1.000
Wald test. H0: appropriate variables dropped
0.510
119.340 0.477
Breusch-Pagan test. H0: residuals homoskedastic
9.014
62.835 0.004 ***
Shapiro-Wilkes test. H0: residuals normal
17.492 9933.851 0.000 ***
Hausman test. H0: Wy exogenous
0.014
37.000 0.906
Hausman test. H0: SCCS.v21 exogenous
0.088
10.000 0.773
Hausman test. H0: WNAI.v135 exogenous
0.010
25.000 0.923
Hausman test. H0: LRB.revensq exogenous
0.017 15398.000 0.898
Hausman test. H0: LRB.subdiv2 exogenous
0.896 2295.000 0.344
Hausman test. H0: LRB.war1sq exogenous
0.000
253.000 0.996
$OtherStats
d
l R2.IV.composite. R2.final.model R2.UR.model R2.final.ln.adj. nimp nobs
1 0.9 0.1
0.6984903
0.5213928
0.5220529
0.4580212
5 297
9
Download