Supplementary Figure S1. The hierarchical theory. (a) Metacommunity level; (b) Community level. Illustrations of the data generation showing three hypothetical species across three hypothetical communities within the study area (the US). At a metacommunity level (a), the data points are species (highlighted in red) and the relationships among their average body size, niche breadth, maximum population abundance across the study area, and distribution within the US are explored as outlined in Fig. 1a, c. Note that in (a) only communities 1 and 3 contain the largest population of a species. Although not all communities are represented in the metacommunity dataset, its geographic range is equivalent to this of the community dataset, described in (b). Furthermore, species’ ln-transformed maximum and average population abundance across all sites of detection are highly correlated (between 90 and 93% in the three studied groups), indicating that the metacommunity analyses are not influenced by the choice of a summary abundance metric. At a community level (b), the data points are properties of the community (highlighted in red), e.g. species richness as well as slopes b and d, which are derived from similar analyses as in the metacommunity level but using species body size (measured at the site or averaged across all collections depending on the dataset), local abundance, and distribution within the US. Then, the two slopes are treated as a response in the pathways, depicted in Fig. 1b, d. Note that in (b) communities 1-3 are included. -1- Supplementary Figure S2. Diatoms. (a) Relationship between two ln-transformed measures of distribution, namely number of occurrences in 720 localities (x) and geographic range (y), calculated as the sum of the maximum latitudinal and maximum longitudinal span in km. Species with a single occurrence have a geographic range of zero. The relationship is fit with a second order formation function (TableCurve 2D 5.01, SYSTAT Software, Inc. 2002), indicating that half of the maximum geographic range is reached when a species has only 1.5 occurrences (x50 = 1/ab = 0.41; e0.41 = 1.5, where a and b = regression parameters). In other words, geographic range increases at a much faster rate than occurrence, whereby species with only a few occurrences exhibit broad geographic spans (see also (b)). Therefore, species occurrence outperforms geographic range as a measure of distribution because species with broad geographic ranges can be present in a handful of localities, contributing only marginally to the regional colonist pool. The regression model and parameters are given in the figure; p < 0.00001 for both parameters. The same pattern is observed in invertebrates (2078 species across 1866 localities, R2 = 0.94, x50 = 1.6) and fish (561 species across 1105 localities, R2 = 0.90, x50 = 1.8). (b) Distribution maps of the diatoms Achnanthidium minutissimum (Kützing) Czarnecki (top panel) and Stauroneis phoenicenteron (Nitzsch) Ehrenberg (bottom panel), showing the localities where these species are found, i.e. 593 localities for A. minutissimum and 12 localities for S. phoenicenteron. Maximum latitudinal and maximum longitudinal span (shown as arrows) are calculated as the range between the minimum and maximum latitude or longitude of detection, respectively, converted to Great Circle distances in km. The inserts show micrographs of the two species (courtesy of Chad Larson). Note that although the two diatoms have continental ranges, their total occurrences are drastically different. -2- Supplementary Figure S3. Testing the metacommunity model. (a) Relationship of diatom niche breadth (NB) vs. proportion (P) of occurrences in streams with common conditions fit with a Gaussian model, which is given in the figure (p < 0.00001 for all parameters). In diatoms as well as invertebrates and fish (discussed below), NB is measured as the species’ root mean square standard deviation across the first four axes of canonical correspondence analysis of species and environmental data (the RMSTOL metric in CANOCO). The value of P (Pu) yielding maximum niche breadth (NBmax) is equal to parameter b and the standard deviation (SD) about NBmax = parameter c. In invertebrates (N = 1739 taxa), the Gaussian model generates the following regression statistics: R2 = 0.46, a = 92.25, b = 0.54, c = 0.32 (p < 0.00001 for all parameters), while in fish (N = 488 species), the respective statistics are: R2 = 0.36, a = 78.47, b = 0.56, c = 0.34 (p < 0.00001 for all parameters). (b)–(d) ANOVA least squares means (± standard error) compared with Tukey post-hoc tests, testing the hypothesis of no difference in occurrence among species of varying environmental preference, including preference for common conditions (c), rare conditions (r), and no preference (n). Species with no environmental preference are defined as having P = Pu ± 1SD, species with a preference for rare conditions, P < Pu − 1SD, and species with a preference for common conditions, P > Pu + 1SD. In diatoms (b), invertebrates (c), and fish (d) n species exhibit the following ranges of P: 0.28-0.88, 0.22-0.87, and 0.23-0.90, respectively. These ranges include the proportion of common sites in each dataset, i.e. 0.70 in the 703 streams sampled for diatoms, 0.69 in the 636 streams sampled for invertebrates, and 0.71 in the 417 streams sampled for fish. Therefore, species without environmental preference are found in common conditions statistically as frequently as these conditions occur continentally. In all analyses, n species have significantly greater occurrence than c and r species (p = 0.000002). Only in diatoms, the occurrence of c species exceeds -3- significantly this of r species (p = 0.00002), while in the remaining two groups, they are statistically equivalent (p > 0.8). These results do not change statistically when distribution is measured as geographic range. There are 474 c, 568 n, and 185 r diatoms in (b); 540 c, 778 n, and 421 r invertebrates in (c); and 194 c, 218 n, and 76 r fish in (d). Different letters in each panel indicate significant differences in means (p < 0.05). Thus, in diatoms there is a tendency for species with a preference for common conditions to exceed in distribution those with a preference for rare conditions, while in invertebrates and fish this trend disappears. -4- Supplementary Figure S4. Invertebrates, testing the metacommunity model. Regressions of distribution (D) when measured as ln number of occurrences in all 1866 streams, against (a) ln maximum population density (Nmax): ln D = –0.21 + 0.54ln Nmax (R2 = 0.61, p < 0.000001) and (b) niche breadth (NB) (the RMSTOL metric): ln D = 0.34 + 0.04NB (R2 = 0.60, p < 0.000001). Species maximum population density is represented by the maximum number of individuals per m2 in 3719 samples from 1866 stream localities, while NB is derived from a subset of 636 streams with environmental data, shown in Fig. 2. When distribution is measured as geographic range, Nmax and NB produce the following models: ln D = 2.88 + 0.75ln Nmax (R2 = 0.36, p < 0.000001) and ln D = 2.45 + 0.64NB0.5 (R2 = 0.63, p < 0.000001 for both parameters). Therefore, the positive patterns persist with both measures of distribution but since occurrence is more sensitive, as shown in Suppl. Fig. S2, it is employed in structural equation modeling. (c) A structural equation model showing the paths (p < 0.05 for all) with corresponding standardized regression coefficients and, in parentheses, coefficients of non-determination (1 – R2) for each response variable. Sample discrepancy function of –2.46e−16 indicates an excellent model fit. E1E2 = error terms. Number of taxa = 1739. -5- Supplementary Figure S5. Fish, testing the metacommunity model. Regressions of ln distribution (D), calculated as the number of occurrences in all 1105 streams, against (a) ln maximum relative abundance (Nmax): ln D = 3.88 + 0.68ln Nmax (R2 = 0.52, p < 0.000001); (b) ln body weight (M): ln D = 1.35 + 0.86ln M – 0.11(ln M)2 (R2 = 0.15, p < 0.00001 for all parameters); and (c) niche breadth (NB) (the RMSTOL metric in CANOCO): ln D = 0.55 + 0.04NB (R2 = 0.63, p < 0.000001). When distribution is measured as geographic range, maximum relative abundance, body weight, and NB produce the following responses: ln D = 8.36 + 1.01ln Nmax (R2 = 0.39, p < 0.000001), ln D = 4.51 + 1.32ln M – 0.16(ln M)2 (R2 = 0.12, p < 0.00001 for all parameters), and ln D = 2.50 + 0.61NB0.5 (R2 = 0.63, p < 0.00001). Therefore, like in diatoms and invertebrates, the two measures of distribution display very similar behaviours but further analyses are performed with occurrence due to its greater discriminating capacity (see Suppl. Fig. S2) and a linear response to NB. Maximum relative abundance and body weight for each species are measured as the maximum proportional abundance and the average body weight, respectively, observed in 2383 samples from 1105 stream localities. Although arcsine square root transformation is recommended for proportional data, lntransformation of Nmax improves normality to a greater extent and is adopted. NB is derived from a subset of 417 streams with environmental data, shown in Fig. 2. (d) A structural equation model showing only the significant paths (p < 0.05) with corresponding standardized regression coefficients and, in parentheses, coefficients of non-determination (1 – R2) for each response variable. A root mean square error of approximation of less than 0.00001 indicates an excellent model fit. E1-E3 = error terms. Number of species = 488. -6- Supplementary Figure S6. Fish, testing the metacommunity model. Quadratic regressions of ln body weight (M) and (a) ln maximum relative abundance (Nmax): ln Nmax = – 3.27 + 0.95ln M – 0.13(ln M)2, R2 = 0.16 (p < 0.00001 for all parameters); and (b) niche breadth (NB): NB = 26.34 + 16.58ln M – 2.01(ln M)2, R2 = 0.14 (p < 0.00001 for all parameters). Both Nmax and NB are measured, as defined in Suppl. Fig. S5. Number of species = 488. -7- Supplementary Figure S7. Invertebrates, testing the community model. (a) Relationships of species richness (S) with significant slope d values (p < 0.1) from linear regressions of local population density (N) against regional distribution (occurrences) (D): ln N = d0 + dln D, where d0 = intercept. A higher p-level than the conventional 5% is adopted to ensure that relationships, slightly affected by single outliers with a tendency to inflate the p-values, are included. The best fit and regression parameters are given in the figure (p < 0.000001). (b) Partitioning of the variance of slope d in the subset with environmental data, as outlined in Legendre & Legendre (1998). The variance of slope d and species richness, explained by the environment, is shown next to the solid red arrows, while the variance of slope d explained by richness, is given next to the solid black arrow. The pure environmental effect (next to the dotted red arrow), the pure richness effect (next to the dotted black arrow), and the covariance effect of richness and environment (in the red triangle) are also shown. The negative predictors of slope d include fluoride concentration and wetland area, while the positive predictors encompass human population density as well as sodium and nitrate concentrations, both concentrations associated with more extensive agriculture. The relationships of invertebrate richness with all but one of these variables (F−) are opposite. Basic statistics of the environmental variables are given in Suppl. Table S1. +/ – = positive/negative correlation with slope d. № = number of communities. Reference Legendre P. & Legendre L. (1998). Numerical Ecology. Second English Edition. Elsevier Science B.V., Amsterdam, The Netherlands. -8- Supplementary Figure S8. Fish, testing the community model. Relationships of species richness (S) with regression parameters of proportion relative abundance (N): (a) significant parameter b2 values (p < 0.1) from the quadratic regressions of N against body size (M): ln N = b0 + b1ln M + b2(ln M)2 (10 out of the 287 significant b2 values are identified as outliers and removed prior to regression) and (b) significant slope d values (p < 0.1) from the linear regressions of N against species distribution (occurrences) (D): ln N = d0 + dln D, where b0 and d0 = intercepts, and b1 = parameter. The fit and regression parameters are given in the figures. p = 0.00003 in the first regression and p < 0.000001 in the second regression. № = number of communities. -9- Supplementary Figure S9. Fish, testing the community model. Partitioning of the variance of parameter b2 from the quadratic regressions of proportion relative abundance (N) against body size (M): ln N = b0 + b1ln M + b2(ln M)2 (a) and slope d from the regressions of N against occurrence (D): ln N = d0 + dln D (b) in the fish subset with environmental data. The variance of parameter b2, slope d, and species richness explained by the environment, is shown next to the solid red arrows, while the variance of parameter b2 and slope d explained by richness, is given next to the solid black arrows. The pure environmental effect (next to the dotted red arrows), the pure richness effect (next to the dotted black arrows), and the covariance effect of richness and environment (in the red triangles) are also shown. To linearize the relationships of richness (S) with the parameter b2 and slope d, it is transformed as S−1. This transformation also increases the R2 of the multiple regressions of richness against the environmental predictors in (a) and (b). Low richness communities with variable but generally positive values of parameter b2 and negative values of slope d are found in streams of lower temperature and comparatively pristine watersheds, covered by forests and shrublands. High richness communities with negative b2 and positive slope d values are detected in streams of higher temperature and more extensive agriculture (percent clay in the soil, which is a negative predictor of parameter b2 and a positive predictor of richness, is highly positively correlated with agriculture). As expected from the relationships in Suppl. Fig. S8, richness and parameter b2 show opposing responses to the environment, while richness and slope d, behave in a similar fashion. Basic statistics of the environmental variables are provided in Suppl. Table S1. +/ – = positive/negative correlation with parameter b2 or slope d. № = number of communities. - 10 -