1 Appendix S1 Model performance 2 1) Additional details of data collection and analysis 3 4 5 6 7 8 9 10 11 12 13 14 Exclusion of species pairs with weak genetic divergence: Sister pairs were included if cytochrome b sequences (of at least 500 base pairs in length) were available for both members of a sister pair, and if GTR- distances (calculated in PAUP 4.0b10; Swofford 2002) exceed 0.75 percent divergence. Very young sisters, whose ages are estimated with error (due to variability in coalescent times and the stochastic nature of sequence evolution), can have large impacts on parameter values estimated from the character evolution models that we use. Here we chose to delete sisters differing by less than 0.75 percent divergence in order to eliminate the most problematically dated sisters. This cut-off corresponds to the 350,000 year average difference between cytochrome b coalescence and actual population splitting (Moore 1995). Sisters with less than 0.75 percent divergence are unlikely to represent phyloegenetically distinct (e.g. reciprocally monophyletic) taxa. This cut-off also eliminates most sisters whose young ages of estimated divergence reflect recent mtDNA introgression across species boundaries. 15 16 17 18 19 20 21 22 Phylogeny generation: In order to determine the approximate age of each sister pair, a phylogeny was created in BEAST version1.7.5 (Drummond et al. 2012)from cytochrome b sequence data using a lognormal relaxed clock (Yule speciation prior) with a GTR- model. The tree topology was fixed (using Barker et al. 2004 for interfamily relationships and a large number of published molecular phylogenies for relationships between species and genera within families) and BEAST was used to estimate branch lengths using a 2% clock (Weir & Schluter 2008). Bayesian analyses were run for 20 million generations, and were sampled every 1000 generations following a burn-in of 10 million generations. Median node heights were used to estimate species ages (see Appendix S2). 23 24 25 26 27 28 Details of climatic data sorting: Because seasonality may not be synchronous in different geographic regions, the monthly values for each climatic measure were sorted from highest to lowest for each locality within a species range, and analysis were performed on sorted values. For example, if June had the highest mean monthly precipitation at a given locality, then June would be placed in first rank for this measure. Sorting reduced overestimation of climatic divergence between sister pairs with asynchronous seasonality. 29 30 31 Climatic divergences: L2 in Equation 1 refers to the degree of climatic divergence for each sister pair (|X1i – X2i|) divided by its expected standard deviation under the best fit model (which was an OU model for PC1 to PC3): 32 Equation S1 33 34 35 36 where βi and Ti are the evolutionary rate and age of divergence respectively for sister pair i, and α is the evolutionary constraint parameter. β and α take the maximum likelihood estimates obtained from fitting the climatic data (Table 1). For PC2, the best fit model was OUnull, and all βi take the same value estimated from the model. For PC1 and PC3, the best fit model was OUβ-linear and βi is equal to: 37 Equation S2 𝐿2 = |X1𝑖− X2𝑖| β √ 𝑖 (1−𝑒𝑥𝑝(−2𝛼𝑇𝑖 )) 𝛼 β𝑖 = 𝑏𝛽 𝐿1𝑖 + 𝑐𝛽 , , 38 39 where b and c are the slope and intercept parameters describing how β changes as a linear function of the latitude for each sister pair (L2i). 40 41 2) Model bias 42 43 44 45 46 47 48 49 50 OU models applied to whole phylogenies (i.e. with a single trait optima applied to all species) are reported to suffer bias in the estimation of the parameter α (Thomas in press). Bias in α is detected by simulating data under a BM model (where α = 0), and then fitting the simulated data to an OU model. Bias occurs when the OU fit results in a non-zero estimate of α across a series of simulated datasets. Here we estimate potential bias for our three PC’s. We simulated 2000 climatic datasets under the BMnull model for each of our three climatic PC’s. Data were simulated using the maximum likelihood estimates of evolutionary rate under the BMnull model for each PC and each simulated dataset was then fit to more complex models (OUnull and OUβ-linear). Median values of α parameters (α for OUnull and OUβ-linear models) were estimated to very close to zero for all three PC’s suggesting almost no bias α (Table S1). 51 52 Table S1 Estimates of bias in the alpha parameter of OU models. Trait PC1 PC1 PC2 PC2 PC3 PC3 Simulated Model BMnull BMnull BMnull BMnull BMnull BMnull Test Model OUnull OUβ-linear OUnull OUβ-linear OUnull OUβ-linear Bias in alpha 0.00233 0.00006 0.00019 0.00000 0.00136 0.00008 53 54 2) Model selection thresholds and type I error rates 55 56 57 58 59 60 61 The model with the lowest AICc (or AIC) is chosen as the best fit of the candidate models. However, less optimal models may have only slightly higher AICc values than the best fit model, and may need to be considered. A general “rule of thumb” is to reject all candidate models with AICc values greater than 2 units above the best fit model (∆AICc). However, this threshold value has been reported to result in elevated Type I error rates in some contexts (e.g. Rabosky 2006), and an appropriate rejection threshold may depend on the models being compared, the sample size, the distribution of species ages in the dataset and other factors (see Rabosky 2006; Gavin et al. 2014). 62 63 64 65 66 67 68 We used simulation to calculate Type I error for BM and OU models applied to our climatic dataset, and used these simulations to calculate an appropriate ∆AICc threshold necessary to reject less parameter rich models (“null” models) in favor of more parameter rich models (“test” models). Our approach follows that of Rabosky (2006). Type I error is the probability of rejecting a true “null” hypothesis in favor of a more parameter rich model. To calculate Type I error we used the MLE of parameters fit to the data under the null models (BMnull and OUnull) in which climatic rates do not vary with latitude, and simulated climatic Euclidean distances using the same number of sister pairs as in our actual dataset, 69 70 71 72 73 with the same ages and latitudes as our actual data. Four-thousand datasets were simulated in EvoRAG for each of the three climatic PC’s. For each simulated dataset, we then calculated the likelihood fit to the set of null models and to models in which climatic rates vary with latitude (BMlinear, OUβ-linear). The Type I error is the proportion of simulations for which AIC was lower (and thus favored) for the rate variable models. 74 75 76 77 78 79 80 81 82 Type I errors are shown in Table S2 and are always lower than 0.17 for the three climatic PC’s. A type I error rate less than 0.05 indicates that the “null” model can be rejected in favor of the “test” model whenever the delta AIC between “null” and “test” is greater than 0. When type I error rates exceed 0.05, then a positive threshold for ∆AICc scores between the “null” and “test” models needs to be established in order to maintain a type I error rate ≤ 0.05. The 95th percentile of the distribution of delta AIC values between the set of null and alternative models defines this threshold level. In order to reject rate constancy across latitude in favor of a model in which rates vary with latitude, a ∆AICc value of 1.5 to 1.9 is required, depending on the PC (Table S2). Here we use more conservative value of 1.9 throughout for all PC’s. 83 84 Table S2 Estimates of Type I error, and the threshold ∆AICc required to reject a null model of no effect of latitude on climatic rates with a Type I error rate ≤ 0.05. Trait PC1 PC1 null model BMnull OUnull Type 1 threshold error ∆AICc 0.16 1.9 0.12 1.5 PC2 PC2 BMnull OUnull 0.15 0.13 1.9 1.5 PC3 PC3 BMnull OUnull 0.17 0.14 1.9 1.7 85 86 3) Power analysis 87 88 89 90 91 92 93 94 95 96 At the request of reviewers we performed retrospective power analyses for our BM and OU models, but note that such analyses are controversial, and many statisticians deem them logically flawed when used to try to interpret non-significant results (e.g. Nakagawa & Foster 2004). To determine the retrospective statistical power under our hypothesis that rates of climatic evolution vary with latitude, we found the maximum likelihood parameter estimates of the best fit gradient model (BMlinear, OUβ-linear ) to our data (performed separately to PC1, PC2, and PC3), simulated 1000 climatic datasets under those parameter estimates and then fit the gradient models and null models (BMnull and OUnull) to each simulated dataset. Statistical power is the proportion of simulations which correctly rejected the null models in favor of the gradient models. We used a ∆AICc threshold of 1.9 (see above) as our rejection criteria in order to maintain a Type I error rate ≤ 0.05. 97 98 99 Result are shown in Table S3 and indicate very high retrospective statistical power for PC1 and PC3. For PC2 retrospective statistical power was very low. This is not surprising given that OUnull was best fit for this PC. The retrospective statistical power for PC2 is not informative about whether the OUnull was best 100 101 102 103 fit because it really is the true model, or because OUβ-linear simply lacked statistical power. However, visual inspection of the raw data (shown in Fig. 2) do not show any clear differences in how climatic divergence accumulates through time for tropical and temperate sister pairs, and we conclude that if there is a latitudinal effect that went undetected due to low power, then the effect was likely very weak. 104 105 106 107 Table S3 Retrospective statistical power for climatic PC’s 1 to 3. The test model is the alternative model under which data is simulated. The “null” model is the less parameter rich model fit to the simulated data. The statistical power is calculated using a ∆AICc threshold value of 2.3 in order to reject the “null” model in favor of the “test” model. Trait PC1 PC2 PC3 Test models BM linear & OUβ-linear BM linear & OUβ-linear BM linear & OUβ-linear Null models BMnull & OUnull BMnull & OUnull BMnull & OUnull Statistical power using critical ∆AICc value of 2.3 0.893 0.058 0.985 108 109 110 4) Analysis without Neotropical migrants included 111 112 113 Model fits when excluding Neotropical migrants (only resident species and species that migrate locally within the tropics, or within the Nearctic are included) had little effect on model fits. Each PC continued to favor the same model as when Neotropical migrants were included (Table S4). 114 115 116 117 118 119 Table S4 Support for BM and OU models of climatic niche evolution when excluding Neotropical migrants. ΔAICc scores (AICc for each model – smallest AICc score) and Akaike Weights (wAICc) are used as metrics of model support. The best-fit model has the smallest ΔAICc value of 0 (bold). Akaike weights indicate the probability of fit for each model. N indicates the number of parameters in each model. β slope describes how the evolutionary rates changes with latitude. MODEL BMnull PC1 PC2 N ΔAICc wAICc β slope ΔAICc wAICc β slope 1 37.48 0.000 NA 19.25 0.000 NA BMlinear 2 19.58 0.000 0.173 18.04 0.000 OUnull 2 4 14.69 0.00 0.001 NA 0.999 3.798 0.00 1.75 OUβ-linear PC3 ΔAICc wAIC β slope 36.52 0.000 NA 0.054 10.18 0.006 0.706 NA 0.294 0.084 20.57 0.00 0.000 NA 0.994 0.082 0.025 120 121 Literature Cited 122 123 124 125 Barker, F.K., Cibois, A., Schikler, P.A., Feinstein, J., & Cracraft, J. (2004). Phylogeny and diversification of the largest avian radiation. Proc. Natl. Acad. Sci.,101, 11040-110453. Drummond, A.J., Suchard, M.A., Xie, D. & Rambaut. (2012). A Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol. Biol. Evol., 29, 1969-1973. 126 127 128 129 130 131 132 133 134 135 136 137 Moore, W. S. 1995. Inferring phylogenies from mtDNA variation: mitochondrial gene trees vs. nuclear gene trees. Evolution. 49, 718–726. Nakagawa, S. & Foster, T. M. (2004). The case against retrospective statistical power analyses with an introduction to power. Rabosky, D. (2006). Likelihood methods for detecting temporal shorts in diversification rates. Evolution. 60: 1152-1165. Swofford, D. L. (2002). PAUP* 4.0b10: phylogenetic analysis using parsimony (*and other methods). Sunderland, MA: Sinauer Associates. Thomas, G. H., N. Cooper, C. Venditti, A. Meade, & R. P. Freckleton. (In press). Bias and measurement error in comparative analyses: a case study with the Ornstein Ulhenbeck model. bioRxiv http://dx.doi.org/10.1101/004036 Weir, J.T. & Schluter, D. (2008). Calibrating the avian molecular clock. Mol. Ecol., 17, 2321– 2328.