Author: Anthon Eff. Version: CCDmanual0dw.docx addesc Add a variable description to the key file Description The function adds a variable description to the key file. This is useful in cases where a new variable is created, whose description is not yet in the key file. The description is then available for use in doOLS output. Usage addesc (nvbs,nvbsdes,dsn=NULL) Arguments nvbs name of variable nvbsdes description of nvbs dsn name of data set variable is based upon (“EA”, “LRB”, “SCCS”, “WNAI”) Value The function appends the description to the key file. Details Note Author(s) Anthon Eff Anthon.Eff@mtsu.edu Examples See under doOLS for complete workflow example. CSVwrite Write object to *.csv file Description The function writes an object, with elements capable of being coerced to a dataframe, to a csv file. It is used to write the output from doOLS to a file that can be read by a spreadsheet. Usage CSVwrite(a1,a2,a3=FALSE) Arguments a1 Object to be written—typically output from function doOLS a2 The base name of the *.csv file (do not include the “.csv” extension) a3 Should the object be appended to the existing file (default=FALSE) Value No values are returned in the R environment; only changes occur to the specified *.csv file. Details Set the option a3=TRUE to append the output of object a1 to an existing file with base name a2. The default will simply overwrite any existing csv file with base name a2. Note Like the write.csv function, except that CSVwrite can append values to an existing csv file, and it can write elements of a list to a csv file. Author(s) 1 Anthon Eff Anthon.Eff@mtsu.edu Examples See under doOLS for complete workflow example. doMI Produce multiple imputed data sets Description The function produces multiple imputed data sets from SCCS data, using methods from the mice package. Usage smi<-doMI(EAvs=NULL,LRBvs=NULL,SCCSvs=NULL,WNAIvs=NULL,nimp=10,maxit=7) Arguments EAvs character string containing names of variables from EA dataset LRBvs character string containing names of variables from LRB dataset SCCSvs character string containing names of variables from SCCS dataset WNAIvs character string containing names of variables from WNAI dataset nimp the number of imputed data sets to create (default=10) maxit the number of iterations used to estimate imputed data (default=7). Value The function doMI returns a dataframe containing the number of imputed datasets specified by the nimp option. The datasets are stacked one atop the other, and indexed by the variable “.imp”. Details This function imputes several new datasets, using covariates for each variable to create a conditional distribution of estimates for each missing value, and then replacing the missing value with a draw from the distribution; as a result, each of the imputed datasets will typically have slightly different values for the estimated cells. The key to successful imputation is to have good covariates for each variable. The function doMI begins the search for good covariates by grouping each variable in a cluster of collinear variables. For each cluster, the best covariates are selected from a set of variables with no missing values, including both network lag variables (based on geographic distance, language, and ecology) and climate and ecology variables. The first four arguments are lists of variable names, from the four ethnographic data sets (EA, LRB, SCCS, and WNAI). These will be the data used in model building. One should include all data one thinks might be useful, but no additional data, since additional variables will add to the time it takes for the procedure to run. The fifth argument is the number of imputed datasets to create: between 5 and 10 imputed data sets are considered adequate, but there is no harm in choosing more; the default is 10. The final argument is the number of iterations to perform in creating each imputed dataset; the default is 7. It is not usually necessary to examine the returned dataframe—it is used in estimating the model, but is not in itself that interesting. Nevertheless, some output is automatically written to the console as it executes, in order to provide some information about the clusters to which the variables have been assigned, and the covariates selected for each cluster. For each cluster, the names of the members are printed, along with the method used for imputation (in most cases “pmm”— predictive mean matching; variables without missing values are indicated by empty quotes). Prefixes “l”, “e”, and “d” indicate spatial lags for, respectively, linguistic, ecological, and geographic proximity. Additionally, those variables that could not be imputed, due to perfect multicollinearity, are indicated as each cluster is processed. Squared terms are then created for those variables with at least three unique values, and with maximum values below 1000. The squared variables are indicated by the “sq” suffix on the original variable name (e.g., “SCCS.v72sq” is the square of “SCCS.v72”). The last step is to identify those variables that are perfectly collinear with a linear combination of other variables—users should consider dropping some of these, so that the problem of perfect multicollinearity does not crop up during estimation. Note Based on the methods proposed by Malcolm M. Dow and E. Anthon Eff. Author(s) Anthon Eff Anthon.Eff@mtsu.edu 2 Examples See under doOLS for complete workflow example. doOLS Estimate OLS model on multiply imputed data Description The function estimates an unrestricted and restricted OLS model, with network lag term, providing common diagnostics. Usage h<-doOLS(smi, depvar, indpv, rindpv=NULL, othexog=NULL, dw=TRUE, lw=TRUE, stepW=FALSE, relimp=FALSE, slmtests=FALSE) Arguments smi a multiply imputed dataset, created by the function doMI depvar the name of the dependent variable (must be in smi) indpv the names of the independent variables for the unrestricted model (must be in smi) rindpv names of restricted model independent variables (must be in indpv; when default of NULL is executed, the restricted model independent variables will be the same as the unrestricted model, minus the last variable) othexog names of additional exogenous variables (must be in smi; will be added to a list of 21 variables; default is NULL) dw Should geographic proximity be used in constructing composite weight matrix (default=TRUE) lw Should linguistic proximity be used in constructing composite weight matrix (default=TRUE) stepW Should stepwise regression be done to show most-selected variables from unrestricted model (default=FALSE) relimp Should relative importance be calculated for independent variables of restricted model (default=FALSE) slmtests Should spatial lag tests be run for the four weight matrices (default=FALSE) Value Returns a list with 11 elements: DependVarb Identification of dependent variable URmodel Coefficient estimates from the unrestricted model (includes standardized coefficients and VIFs) Rmodel Coefficient estimates from the restricted model RmodelRobust Coefficient estimates from the restricted model with robust SEs Diagnostics Regression diagnostics for the restricted model (RESET test; Wald test on model restrictions; BreuschPagan heteroskedasticity test; Shapiro-Wilkes test for normality of residuals; Hausman tests for endogeneity of independent variables). OtherStats Other statistics: Composite weight matrix weights (see details); R2 for all models (model creating instrument for network lag term; restricted model; unrestricted model); number of imputations; number of observations. DescripStats Descriptive statistics for variables in unrestricted model. dfbetas Influential observations for dfbetas (see details) totry Character string of variables that were most significant in the unrestricted model as well as additional variables that proved significant using the add1 function on the restricted model. didwell Character string of variables that were most significant in the unrestricted model. interacts Character string of interaction variables that proved significant using the add1 function on the restricted model. Details Users can choose two kinds of proximity/similarity weight matrices for constructing a network lag term: geographic and linguistic. In most cases, users should choose both (the defaults). The optimal composite weight matrix, constructed as the weighted sum of the weight matrices, is that which maximizes unrestricted model R 2. The network lag term is entered in each model as the variable “Wy”. The dfbetas are scaled changes in coefficient estimates caused by adding an observation to the model. Only the most influential dfbetas are output. The stepwise procedure can provide additional insight on which independent variables provide the best model fit. Since the imputed datasets differ slightly from each other, the variables selected by a stepwise procedure typically differ slightly for each imputed dataset. If the stepW=TRUE option is chosen, a column labeled “stepkept” will be added to the table reporting unrestricted model results. The column reports the number of times the independent variable was retained in the model by a stepwise procedure using both forward and backward selection. 3 The add1 function tests whether the members of a list of variables prove significant when added singly to a model. The list of variables includes all numeric variables in the imputed dataset, as well as squared terms of variables currently in the unrestricted regression. Variables proving significant in over 80 percent of the imputations are returned in the character string “totry”. Relative importance is a method of assigning R2 to each independent variable. The method repeatedly estimates a model, first with one independent variable, then with two, etc. and calculates the change in R2 as each variable is introduced. The order of entry is changed, and the process repeated, to consider all possible orders of entry. The relative importance measure is the average change in R2 across all these different models. With large numbers of independent variables, the calculations are prohibitively slow. Setting relimp=TRUE will calculate the relative importance of independent variables in the restricted model, and report these in the column labeled “relimp”. Note Based on the methods proposed by Malcolm M. Dow and E. Anthon Eff. Author(s) Anthon Eff Anthon.Eff@mtsu.edu Examples library(mice) library(foreign) library(stringr) library(psych) library(AER) library(relaimpo) library(geosphere) library(spdep) # --bring in functions and data-load(url("http://dl.dropbox.com/u/9256203/DEz2.Rdata"),.GlobalEnv) ls() #-can see the objects contained in DEz2.Rdata #--list and modify variables for use in model-# --make new variables-xcd$SCCS.valchild<-(xcd$SCCS.v473+xcd$SCCS.v474+xcd$SCCS.v475+xcd$SCCS.v476) # --create descriptions for new variables-addesc("SCCS.valchild","Degree to which society values children") addesc("Wy","Network lag term") # --create new dummy variables-xcd<-cbind(xcd,mkdummy("SCCS","v899",1)) # --identify variables to keep for model building-ev<-c("v30","v78") lv<-c("group2","hunting","gatherin","fishing","huntfil2", "war1","reven","nomov","dismov","store","subdiv2") sv<-c("v1685","v72","v234","v236","v238","v1648","v899d1", "valchild","v1260","v79","v80","v81","v872","v871") wv<-c("v284","v285","v286","v288","v289","v135") # --make imputed data-smi<-doMI(EAvs=ev,LRBvs=lv,SCCSvs=sv,WNAIvs=wv,nimp=5,maxit=5) names(smi) #--can see which variables are available smi$LRB.lngroup2<-log(smi$LRB.group2) xcd$LRB.lngroup2<-log(xcd$LRB.group2) addesc("LRB.lngroup2","Natural log of LRB.group2") # --identify role of variables in model-dV<-"LRB.lngroup2" riv<-uiv<-c("SCCS.v21","WNAI.v135","LRB.revensq","LRB.subdiv2","LRB.war1sq") h<-doOLS(fff=smi,depvar=dV,indpv=c("SCCS.v1260",uiv), rindpv=riv,othexog=NULL,dw=TRUE,lw=TRUE, stepW=TRUE,relimp=TRUE,slmtests=FALSE) print(h) # --print output to csv file-CSVwrite(h,"myOutput",FALSE) 4 keyf keyfile dataset Description The data.frame keyf contains information about variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI. Format rownames variable type description NOTmissing class nUniqVals FNOTmissing Fclass FnUniqVals db levels Variable names from the data.frame xcd Variable names as given within the ethnographic dataset ( EA, LRB, SCCS, or WNAI) Variable type (“ordinal” or “categorical”) Variable description Number of non-missing values for variable Variable class (“character” or “numeric”) Number of unique data values for variable For the factor version of the ethnographic dataset: Number of non-missing values for variable For the factor version of the ethnographic dataset: Variable class (“character”, “factor”, “integer”, or “numeric”) For the factor version of the ethnographic dataset: Number of unique data values for variable Source ethnographic dataset ( EA, LRB, SCCS, or WNAI). GIS data is indicated as “gisX”. Factor levels for variables defined as factors in the factor version (and with fewer than 20 factor levels). Examples head(keyf) mkdummy Make dummy variable and store a description in key file Description The function makes a dummy variable from a variable in the data.frame xcd, and creates a description stored in the data.frame keyf. Usage mkdummy(dsn,vv,val) Arguments dsn name of an ethnographic dataset (EA, LRB, SCCS, or WNAI) vv name of a variable from the specified ethnographic dataset val the value of variable vv for which the dummy equals one. Value The function returns a variable named dsn.vvdval, which equals one when xcd$dsn.vv==val, and equals zero otherwise. Details The main reason to use this function is that it will automatically append a description for the dummy variable to the key file, which is then available for use in doOLS output. The description is created using the variable name from the key file and the description of the value from the “levels” variable in the data.frame keyf. Note Author(s) Anthon Eff Anthon.Eff@mtsu.edu Examples See under doOLS for complete workflow example. mkwtmat Make and format three weight matrices for the societies in data.frame xcd 5 Description The function makes and formats three weight matrices (geographic, linguistic, and ecological) for the societies in data.frame xcd. Usage mkwtmat() Value The function returns three matrices: ddm eem llm Geographic proximity, based on the latitude and longitude fields in data.frame xcd. Each cell is the inverted squared distance between the row society and column society. The diagonal is set to zero, and then the rows are normalized so that their sum equals one. Ecological proximity, based on the Euclidean distance between societies in the 22-dimensional space defined by 19 climate variables, two altitude variables, and one measure of met primary productivity (all variables scaled to standard normal before distances are calculated). Each cell is exp(-d), where d is the distance between the row society and column society. The diagonal is set to zero, and then the rows are normalized so that their sum equals one. Linguistic proximity between each row and column society. This matrix is not created, but only row normalized. Details Since the geographic and ecological matrices are relatively fast to compute, but very large, it is more efficient to create them than to load an already constructed matrix. The linguistic matrix, on the other hand, takes a very long time to compute, but is small (many fewer unique values) and is therefore loaded with the other data and only row-normalized in this function. The function is run one time in the doMI function, making the matrices available both in the function and in the general environment. Note Author(s) Anthon Eff Anthon.Eff@mtsu.edu Examples See under doOLS for complete workflow example. xcd Cross cultural dataset Description The data.frame xcd contains the variables from four ethnographic datasets: EA, LRB, SCCS, and WNAI. The number of societies represented in each of the datasets is 1267 (EA), 339 (LRB), 186 (SCCS), and 172 (WNAI), for a total of 1964 records in the four datasets. However, some societies appear in more than one dataset (1090 appear only in one; 257 appear in two; 108 appear in three; and nine appear in all four), so there are 1464 unique societies. The data.frame xcd therefore contains 1464 observations and 2916 variables: 111 from EA; 262 from LRB; 2055 from SCCS; 440 from WNAI; and 48 that are drawn from GIS data. Format For each variable drawn from an ethnographic dataset, the variable name is “XX.vv” where “XX” is the name of the ethnographic dataset, and “vv” is the name of the variable in that dataset. For example, variable “v207” from SCCS is names “SCCS.v207”. Examples dim(xcd) 6 Selected output [1] [1] [1] [1] [1] [2] [3] [4] [5] [6] [1] [1] [1] [1] "--make eem (ecological similarity weight matrix)----" "--make ddm (geographic proximity weight matrix)----" "--format llm (linguistic proximity weight matrix)----" "--assembling data to be used in imputation--" "Please check the codebook to see if 'EA.v11' is ordinal" "Please check the codebook to see if 'LRB.systate3' is ordinal" "Please check the codebook to see if 'LRB.fres1' is ordinal" "Please check the codebook to see if 'WNAI.v289' is ordinal" "Please check the codebook to see if 'WNAI.v308' is ordinal" "Please check the codebook to see if 'SCCS.v69' is ordinal" "--making spatially lagged variables for use as covariates--" "--using cluster analysis to organize variables in similar groups--" "---Multiple imputation begins---" "--Cluster--1--" EA.v30 LRB.systate3 LRB.density WNAI.v284 WNAI.v286 WNAI.v288 "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" WNAI.v289 WNAI.v216 SCCS.v234 SCCS.v236 SCCS.v80 SCCS.v1130 "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" dSCCS.v236 dEA.v30 bio.8 lLRB.density bio.7 eSCCS.v236 "" "" "" "" "" "" lEA.v30 dLRB.systate3 "" "" it im co dep meth out 1 0 0 0 collinear SCCS.v234 [1] "--Cluster--2--" EA.v78 EA.v11 EA.v1 EA.v2 LRB.hunting LRB.gatherin "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" LRB.huntfil2 LRB.reven LRB.nomov LRB.dismov LRB.subdiv2 WNAI.v285 "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" WNAI.v204 WNAI.v211 SCCS.v1685 SCCS.v72 SCCS.v899d1 SCCS.valchild "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" SCCS.v1260 SCCS.v79 SCCS.v872 SCCS.v871 SCCS.v69 bio.3 "pmm" "pmm" "pmm" "pmm" "pmm" "" bio.14 bio.16 bio.10 bio.15 bio.2 bio.19 "" "" "" "" "" "" bio.11 "" [1] "--Cluster--3--" EA.v31 EA.v4 EA.v5 WNAI.v193 SCCS.v238 SCCS.v1122 dEA.v4 "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" "" lEA.v31 lEA.v4 eEA.v4 eSCCS.v1122 dSCCS.v238 eEA.v31 "" "" "" "" "" "" [1] "--Cluster--4--" EA.v3 LRB.group2 LRB.fishing LRB.war1 LRB.store LRB.fres1 LRB.tlpop "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" "pmm" WNAI.v197 WNAI.v308 SCCS.v1648 SCCS.v81 SCCS.v21 bio.2 bio.17 "pmm" "pmm" "pmm" "pmm" "pmm" "" "" bio.12 bio.19 bio.10 mnnpp bio.1 "" "" "" "" "" [1] "--checking for perfect multicollinearity among variables--" [1] "--creating squared terms--" Time difference of 8.670014 mins dEA.v31 "" WNAI.v135 "pmm" bio.14 "" $DependVarb [1] "Dependent variable='LRB.lngroup2': Natural log of LRB.group2" $URmodel desc (Intercept) <NA> Wy Network lag term 7 SCCS.v1260 Total Pathogen Stress SCCS.v21 Food Surplus via Storage WNAI.v135 Quantity of fish available in tribal territory: Average annual production in pounds per square mile of territory LRB.revensq Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)--squared LRB.subdiv2 Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") ); (Binford 2001:403,fn2) LRB.war1sq Scale of intensity of warfare. How frequent and how widespread it may be regionally.--squared coef stdcoef pvalue star VIF stepkept (Intercept) 0.787672173 NA 0.193 NA 5 Wy 0.882333522 0.462 0.000 *** 2.508 5 SCCS.v1260 -0.005695942 -0.029 0.625 1.372 0 SCCS.v21 0.035928589 0.035 0.506 1.096 2 WNAI.v135 -0.011789286 -0.039 0.532 1.521 0 LRB.revensq 0.016588887 0.084 0.057 * 1.143 5 LRB.subdiv2 -0.007208441 -0.109 0.012 ** 1.122 5 LRB.war1sq 0.038152870 0.261 0.000 *** 1.658 5 $Rmodel desc (Intercept) <NA> Wy Network lag term SCCS.v21 Food Surplus via Storage WNAI.v135 Quantity of fish available in tribal territory: Average annual production in pounds per square mile of territory LRB.revensq Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)--squared LRB.subdiv2 Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") ); (Binford 2001:403,fn2) LRB.war1sq Scale of intensity of warfare. How frequent and how widespread it may be regionally.--squared coef stdcoef pvalue star VIF relimp (Intercept) 0.632860743 NA 0.221 NA NA Wy 0.907502623 0.475 0.000 *** 2.178 0.232 SCCS.v21 0.036305461 0.036 0.503 1.086 0.015 WNAI.v135 -0.010477738 -0.034 0.562 1.491 0.032 LRB.revensq 0.016393294 0.083 0.056 * 1.132 0.004 LRB.subdiv2 -0.007278972 -0.110 0.010 ** 1.099 0.030 LRB.war1sq 0.037348487 0.255 0.000 *** 1.612 0.144 $RmodelRobust desc (Intercept) <NA> Wy Network lag term SCCS.v21 Food Surplus via Storage WNAI.v135 Quantity of fish available in tribal territory: Average annual production in pounds per square mile of territory LRB.revensq Unevenness in rainfall across seasons; (Equation: 4.04); (Binford 2001:70)--squared LRB.subdiv2 Subsistence diversity; (Equation: 100-stddev("hunting","gatherin","fishing") ); (Binford 2001:403,fn2) LRB.war1sq Scale of intensity of warfare. How frequent and how widespread it may be regionally.--squared coef stdcoef pvalue star VIF relimp (Intercept) 0.632860743 NA 0.229 NA NA Wy 0.907502623 0.475 0.000 *** 2.178 0.232 SCCS.v21 0.036305461 0.036 0.499 1.086 0.015 WNAI.v135 -0.010477738 -0.034 0.546 1.491 0.032 LRB.revensq 0.016393294 0.083 0.013 ** 1.132 0.004 LRB.subdiv2 -0.007278972 -0.110 0.013 ** 1.099 0.030 LRB.war1sq 0.037348487 0.255 0.000 *** 1.612 0.144 8 $Diagnostics Fstat df pvalue star RESET test. H0: model has correct functional form -0.079 28.074 1.000 Wald test. H0: appropriate variables dropped 0.510 119.340 0.477 Breusch-Pagan test. H0: residuals homoskedastic 9.014 62.835 0.004 *** Shapiro-Wilkes test. H0: residuals normal 17.492 9933.851 0.000 *** Hausman test. H0: Wy exogenous 0.014 37.000 0.906 Hausman test. H0: SCCS.v21 exogenous 0.088 10.000 0.773 Hausman test. H0: WNAI.v135 exogenous 0.010 25.000 0.923 Hausman test. H0: LRB.revensq exogenous 0.017 15398.000 0.898 Hausman test. H0: LRB.subdiv2 exogenous 0.896 2295.000 0.344 Hausman test. H0: LRB.war1sq exogenous 0.000 253.000 0.996 $OtherStats d l R2.IV.composite. R2.final.model R2.UR.model R2.final.ln.adj. nimp nobs 1 0.9 0.1 0.6984903 0.5213928 0.5220529 0.4580212 5 297 9