Supplementary Material 1. Species selection, environmental variables and modeling procedures Article title: Land use and land cover effects on regional biodiversity distribution in a subtropical dry forest: a hierarchical integrative multi-taxa study Authors: Ricardo Torres, N. Ignacio Gasparri2, Pedro G. Blendinger2 and H. Ricardo Grau2 1 Cátedra de Diversidad Animal II, y Museo de Zoología, Facultad de Ciencias Exactas, Físicas y Naturales, Universidad Nacional de Córdoba (UNC), Av. Vélez Sarsfield 299, 5000 Córdoba, Argentina. E-mail: rtorres44@gmail.com 2 Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) - Instituto de Ecología Regional - Laboratorio de Investigaciones Ecológicas de Las Yungas (IER-LIEY). Facultad de Ciencias Naturales e Instituto Miguel Lillo, Universidad Nacional de Tucumán (UNT), CC 34, 4107 Yerba Buena, Tucumán, Argentina Corresponding author: Ricardo Torres, Cátedra de Diversidad Animal II, and Museo de Zoología, Facultad de Ciencias Exactas, Físicas y Naturales, Universidad Nacional de Córdoba (UNC), Av. Vélez Sarsfield 299 (5000) Córdoba, Argentina. E-mail: rtorres44@gmail.com Species selected Given that distribution model performances can differ among species according to their different autoecological characteristics (Hernandez et al. 2006), species selection was aimed to include the widest possible range of ecological traits, on the basis of four criteria. First, range of distribution, thus selecting a balanced number of species both widely distributed (generalist species with continent-wide distributions) and regionally restricted (species whose distributional area is limited mainly to the Chaco, including several endemisms). Second, based on the literature (e.g. Short 1975; Cei 1980; Redford and Eisenberg 1992; Prado 1993; and references therein), specialists opinion and our own knowledge, we selected species representative of the major habitats in the study area (forests, grasslands and wetlands). As forests currently dominate in the study area, species number resulted unbalanced in favor to forest-dweller species. Third, for birds and mammals we took into account their diets (including herbivores, granivores, insectivores, carnivores and scavengers) to ensure good representation of different functional groups. Finally, we considered both terrestrial and arboreal species of the forests-dweller fauna. Occurrence records for all species were obtained from museum collections and localities cited in the bibliography (see Supplementary Material 2 for the sources). Additionally, field surveys of trees, birds and mammals were carried out during winter of 2009 in 40 sites in the NADC, randomly selecting locations within areas with scarce or absent previous records of the species considered. Given that models in the ACE were fitted only with biophysical variables, we used all occurrence records from 1950 to the present, according to the period covered by the bioclimatic layers in the WorldClim database that we used (Hijmans et al. 2005). However, in the NADC we modeled only with records between 1970 to present, as the main LULC changes in NADC occurred after that date. Therefore, an initial list of 241 species was filtered down by eliminating species with less than five records [the minimal number of occurrence localities that can provide reliable models with MAXENT (Pearson et al. 2007)] after 1970 in the NADC, to end up with 138 species that were modeled for the ACE (Supplementary Material 3). For each species, we excluded occurrence records from museum databases without information of exact geographic location. Therefore, we assumed that the associated error to the varying accuracy of the occurrence localities is less than the resolution used to model in the NADC (0.5 arc-minutes; see below). The minimum and maximum numbers of occurrence points used for modeling in the ACE for any species were 11-107 for trees, 12-84 for amphibians, 13-97 for birds and 14-69 for mammals, while for modeling in the NADC were 5-82 for trees, 5-39 for amphibians, 5-38 for birds and 6-36 for mammals (Supplementary Material 3). Species distribution modeling The modeling of species distribution was performed using MAXENT v3.3, a software package that implements a maximum entropy algorithm to generate a probability distribution over pixels in a grid of the modeling area (Phillips et al. 2006; Elith et al. 2011). The maximum entropy algorithm has been shown to be robust for modeling presence-only occurrence data, even with very low numbers of occurrence records, outperforming many other techniques (Elith et al. 2006). MAXENT is also able to manage both continuous and categorical variables (Phillips and Dudík 2008). As recommended by these authors, we adjusted the MAXENT parameters to default values. Following Pearson et al. (2004) and Anadón et al. (2007), a hierarchical framework was adopted for model building, conducting first models on the Argentine Chaco Ecoregion (ACE) and then on the Northern Argentine Dry Chaco (NADC). Models based on biophysical variables Models for ACE were fitted using the 19 bioclimatic variables from the WorldClim database (Hijmans et al. 2005), at a resolution of 30 arc-seconds. A layer of distances to water bodies was developed from the Shuttle Radar Topographic Mission (SRTM) Water Body Data (http://dds.cr.usgs.gov/srtm/ version2_1/SWBD/) for inclusion in the modeling. An elevation layer from 30 arc-seconds SRTM elevation model (http://srtm.usgs.gov/_) was also used and a slope layer was additionally derived from it. Soil variables extracted from the Atlas de Suelos de la República Argentina (INTA, 1995) were added to the models of tree species, burrowing amphibians, burrowing mammals, and mammals that feeds on ground social ants and termites. All variables were interpolated to a resolution of two arc-minutes (ca 3.5 km). Layers were prepared and formatted using IDRISI v15 Andes (Eastman 2006). Occurrence points were unevenly clustered in space. Because this pattern may influence the prediction of the model, a ‘bias grid’ was created as the inverse of the Euclidian distance to all points; the MAXENT interface allows the inclusion of this grid in model fit to control the effect of the spatial bias, giving less weight to pixels closest to each presence point (Elith et al. 2011). For each one of the 138 species, we obtained an initial set of ten models in the ACE by randomly selecting 75% of the occurrence localities at each run for training, and leaving the remaining 25% for testing. This initial set of models was used to identify variables with minimal or no contribution to the models using the MAXENT jackknife test of variable importance to evaluate the relative strength of each predictor variable (Yost et al. 2008). This is done by assessing the behavior of the ‘gain’, an internal measure of goodness of fit defined as the average log probability of the presence samples, minus a constant that makes that uniform distributions of the presence samples have zero gain (Phillips 2010). In this process, MAXENT first calculates the average gain corresponding to training localities, and later the drop in such average training gain when each variable is omitted from the full model. Thereby, those variables that did not produce a decrease greater than 0.01 in the average training gain when they were omitted were removed. Covariation between the remaining variables was tested by the Spearman rs coefficient, considering only the cells with presence data. Only pairs of variables with an rs value >0.70 were considered as significantly correlated. The average training gain values of correlated variables was examined once again; those variables showing the lowest decrease in gain values when omitted from the full model were also removed. We performed 100 replicates of the maximum entropy model with the reduced set of variables, again selecting at random 75% of occurrences for training and 25% for testing at each run. One hundred replicates were appropriate for this procedure, since with 10,000 pseudo-absence points (as we used in MAXENT), distributions models can reach stability with ten or even less replicates (Barbet-Massin et al. 2012). The values of the area under the curve (AUC) of the receiver operating characteristic (ROC) plot (Fielding and Bell 1997; Manel et al. 2001; McPherson et al. 2004) for test points were examined, and the ten models with the greatest AUC values were averaged to obtain the final model. Any variable identified as not relevant at this point was removed, and again a run of 100 replicates was performed following the above mentioned procedure. This process was repeated until no non-relevant variable remained in the model (i.e. the decrease in training gain when any of the remaining variables was omitted from the full model was greater than 0.01). To generate a binary prediction of occurrence, a necessary step is to choose a threshold. For this, we selected the ‘maximum test sensitivity plus specificity logistic threshold’ provided by MAXENT. That threshold, although somewhat restrictive, generally offered a spatial representation more according with the distribution of presence points in the study area (Torres and Jayat 2010). We decided to exclude from further analysis those species with final ACE models with AUC values below 0.65, or those with final ACE models with less than two variables causing a decrease in training gain greater than 0.01 when omitted from the full model. Thirty out of the 138 species whose models did not meet these requirements were not modeled in NADC. Although no published compilation exists on the actual number of species in the NADC, the final number of bird and mammal species here modeled is a small proportion of the total species pool in the study area. However, the number of species with different ecological traits and abundances were balanced in terms of abundance and functional characteristics (see Supplementary Material 3), except of the bias towards forest-dwelling species mentioned above. Therefore, we considered that the sample of species modeled is generally representative of the total pool of species in the NADC. Models based on biophysical and LULC variables We proceeded to model within the NADC each one of the 108 species (18 trees, 25 amphibians, 48 birds and 17 mammals) that met the criteria for the analysis of changes in richness patterns, with a resolution of 30 arcseconds (approximately 880 m in the NADC). The climatic and topographic variables used here were those relevant at the scale of the ACE, plus four layers related to land use and land cover: woody biomass, density of puestos, vegetation classes and distance to crops. a) Woody biomass was used as a proxy of human intervention over the forest structure. The woody biomass was obtained using field estimates and the Normalized Difference Vegetation Index (NDVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS) of the satellite Terra [see Gasparri and Baldi (2013) for a detailed explanation of the methodology]. In brief, to generate a regional map of woody biomass, diameter at breast height data of 50 field forest samples were associated to MODIS-Terra spectral data (NDVI) in the dry season by using the Random Forest algorithm (Breiman 2001). The resulting model showed an acceptable mean predicted versus mean observed deviation (below 3%) and an average deviation for a single prediction of 15% (Gasparri and Baldi 2013). b) A previous map of puestos made by interpretation of Landsat images (Grau et al. 2008) was edited and extended to cover the complete NADC following the same methods and the additional use of the GOOGLE EARTH v5.0 software for interpretation of satellite images. A layer of puestos density was further developed, where the values in each cell represent the number of puestos at five cell (4.4 km) radius from the centre of that cell. c) Vegetation classes consist of four vegetation categories (crops, woodlands, flooded and riparian vegetation, and grasslands/bare soil) plus one of permanent water bodies (Fig. 1). Crops were mapped by visual interpretation of Landsat images following standard methods on the Argentine forest monitoring system (UMSEF 2012) previously used in the region (Grau et al. 2005; Gasparri and Grau 2009; Gasparri et al. 2013). To identify the other four classes, a digital classification of the green, red, infrared and near infrared bands (i.e. bands 1 to 4) was applied in a multitemporal set of MODIS images with 250 m of spatial resolution. Three different dates (03/06/2007; 06/12/2007 and 10/16/2007) were used to capture the different stages of the vegetation phenology (Gasparri and Baldi 2013). The classification was performed with the Random Forest algorithm; an independent evaluation of the resulting maps showed a precision >85%. Random Forest analyses were performed with the R software (R Development Core Team 2008) using the randomForest package (Liaw and Wiener 2002). Land cover and above ground biomass mapping were performed in R software using the YaImpute (Crookston and Finley 2008) and SP (R Development Core Team 2008) packages. The vegetation layer was smoothed by applying a moving window so that each cell expresses the dominant vegetation class at 1 km radius from the centre of that cell. Except for deforestation in favor of agricultural areas, we assumed that main vegetation disturbances (as overgrazing and charcoal and firewood harvesting) were fixed at the current levels prior to 1970s in the NADC, with little or no changes in the vegetation classes (as well as spatial patterns in woody biomass or density of puestos) since that date. d) Distance from crops, which was defined as the average distance from cultivated areas isolated from the landscape in Landsat images of the years 1972, 1990, 1998, 2002 and 2007. The distance to crops was included because crops grew since 1970, and being the vegetation layer made from images from a single recent year, older species occurrence records may be erroneously assigned to crops when in fact they were recorded in a different vegetation class. In this way, any presence locality recorded in forests (or any other natural vegetation) but falling today in a cell representing crops were related to an average value of distance to crops rather than assigned to a categorical classification of ‘natural vegetation’ vs ‘crops’. Models in NADC followed all the same procedures as in ACE, with the exception of species with very few (five to ten) presence records. Such species were modeled with MAXENT and the help of the pVALUECOMPUTE software following the Jackknife validation (n-1) procedure detailed in Pearson et al. (2007). References Anadón JD, Giménez A, Martínez M, Palazón JA, Esteve MA (2007) Assessing changes in habitat quality due to land use changes in the spur-thighed tortoise Testudo graeca using hierarchical predictive habitat models. Divers Distrib 13:324–331 Barbet-Massin M, Jiguet F, Albert CH, Thuiller W. 2012. Selecting pseudo-absences for species distribution models: how, where and how many? Methods Ecol Evol 3:327–338. Breiman L (2001) Random forest. Mach Learn 45:5–32 Cei JM (1980) Amphibians of Argentina. Monit Zool Ital Monograph 2 Crookston NL, Finley AO (2008) YaInpute: an R package for kNN imputation. J Stat Softw 23:1–16 Eastman JR (2006) IDRISI Andes Tutorial. Clark Labs, Clark University, Worcester, MA Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A. et al (2006) Novel methods improve prediction of species’ distribution from occurrence data. Ecography 19:129–151 Elith J, Phillips SJ, Hastie T, Dudík M, Chee YE, Yates CJ (2011) A statistical explanation of MaxEnt for ecologists. Divers Distrib 17:43–57 Fielding AH, Bell JF (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ Conserv 24:38–49 Gasparri NI, Baldi G (2013) Regional patterns and controls of biomass in semiarid woodlands: lessons from the Northern Argentina Dry Chaco. Reg Environ Change DOI: 10.1007/s10113-013-0422-x Gasparri NI, Grau HR (2009) Deforestation and fragmentation of Chaco dry forest in NW Argentina. Forest Ecol Manag 258:913–921 Gasparri NI, Grau HR, Gutiérrez Angonese J (2013) Linkages between soybean and neotropical deforestation: Coupling and transient decoupling dynamics in a multi-decadal analysis. Global Environ Chang 23: 1605– 1614 Grau HR, Gasparri NI, Aide TM (2005) Agriculture expansion and deforestation in seasonally dry forests of north-west Argentina. Environ Conserv 32:140–148 Grau HR, Gasparri NI, Aide TM (2008) Balancing food production and nature conservation in the Neotropical dry forests of northern Argentina. Glob Change Biol 14:985–997 Hernandez PA, Graham CH, Master LL, Albert DL (2006) The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography 29:773–785 Hijmans RJ, Cameron SE, Parra JL, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978 INTA (1995) Atlas de Suelos de la República Argentina. CDROM. Instituto Nacional de Tecnología Agropecuaria, Buenos Aires, Argentina Liaw A, Wiener M (2002) Classification and regression by Random Forest. R-news 2:18–22 http://CRAN.Rproject.org/ Manel S, Williams HC, Ormerod SJ (2001) Evaluating presence-absence models in ecology: the need to account for prevalence. J Appl Ecol 38:921–931 McPherson JM, Jetz W, Rogers DJ (2004) The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? J Appl Ecol 41:811–823 Pearson RG, Dawson TP, Liu C (2004) Modeling species distribution in Britain: a hierarchical integration of climate and land-cover data. Ecography 27:285–289 Pearson RG, Raxworthy CJ, Nakamura M, Peterson AT (2007) Predicting species distributions from small numbers of occurrence records: a test case using cryptic geckos in Madagascar. J Biogeogr 34:102–117 Phillips SJ (2010) A brief tutorial on Maxent. Lessons in Conservation 3:107–135 Phillips SJ, Dudík M (2008) Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation. Ecography 31:161–175 Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190:231–259 Prado D (1993) What is the Gran Chaco vegetation in South America? I. A review. Contribution to the study of the flora and vegetation of the Chaco. V. Candollea 48:145–172 R Development core team (2008) A language and environment for statistic computing, R foundation for statistical computing. Vienna, Austria ISBN 3-900051-07-0. http://CRAN.R-project.org/ Redford KH, Eisenberg JF (1992) Mammals of the Neotropics. Vol. 2. The Southern Cone. Chicago University Press, Chicago Short L (1975) A zoogeographic analysis of the South American Chaco avifauna. B Am Mus Nat Hist 154:163– 352 Torres R, Jayat JP (2010) Modelos predictivos de distribución para cuatro especies de mamíferos (Cingulata, Artiodactyla y Rodentia) típicas del Chaco en Argentina. Mastozool Neotrop 17:335–352 Yost AC, Petersen SL, Gregg M, Miller R (2008) Predictive modeling and mapping Sage Grouse (Centrocercus urophasianus) nesting habitat using Maximum Entropy and a long-term dataset from Southern Oregon. Ecol Inform 3:375–386