Review of “Selenium accumulation and red-winged blackbird productivity 2003-2005” Carl James Schwarz, P.Stat. Department of Statistics and Actuarial Science Simon Fraser University 8888 University Drive Burnaby, BC V5A 1S6 cschwarz@stat.sfu.ca 2010-05-05 1. Introduction This is a review of the statistical methodology used in the report “Selenium accumulation and red-winged blackbird productivity 2003-2005” what was prepared by SciWrite Environmental Services, dated 2007-04-13. A pdf file of the report was supplied to me by the B.C. Ministry of Environment. 2. Sampling Protocol. The sampling protocol was held fairly constant over the three years of the study as follows: Sediment and Water Sampling Triplicate water samples were taken from all sites in 2005 and selected sites in 2004. Each sample was measured for the concentration of Se. The raw data are presented in Appendix 2 (Table 9) in the report. Sediment samples were taken from selected sites. The raw data are presented in Appendix 2 (Table 10) in the report. Egg Collection and Productivity Blackbird nests were identified at various sites in the study area. Either 0, 1 or 2 eggs were removed from selected nests. The removed eggs were weighed, sized, and the concentration of Se was determined (excluding the shell). The raw data are presented in Appendix 5 (Table 20 and Table 21) of the report. The remaining eggs in the nest were monitored over time, and the number of failed eggs, hatched eggs, and fledged chicks was determined. The raw data are presented in Appendix 5 (Table 24 and Table 25) of the report. Prey Item Sampling: In 2005, prey items were sampled from live chicks at selected sites. These were analyzed for total Se. The raw data are available in Appendix 4 (Table 17) Liver sampling: 1 Dead nestlings had their liver excised and Se measured. The raw data are presented in Appendix 9 Table 31. Blood sampling: Blood glutathione peroxidase activity was measured using blood samples from juveniles in selected sites. The raw data are presented in Appendix 8 (Table 30). The sites in the study area were classified by Se concentration in the water/sediment as High Se Exposure, Low Se Exposure, or Reference, the latter presumably was not affected by mining operations in the study area. Not every site was monitored in all years of the study. 3. Review of Statistical Analysis The analyses done in this report are very briefly reported on page 24, but the description is too brief to know exactly how the data were analyzed. I extracted the raw data from the appendices in the report to assess which models were fit and to verify the findings. Based on what was reported and the comparison with a more appropriate reanalysis of the data, I have concluded that, unfortunately, most of the analyses were done incorrectly in the report and that many of the conclusions are not supported by the data. In the following sections, I briefly review my finding and present more appropriate models. The more appropriate models are all available in standard statistical packages and do not require specialized software to fit. I have not reviewed all of the analyses in the report but have concentrated on the analyses with large amounts of data. 3.1 Analysis of Prey Items (page 32): On page 32, the report states that “Of the prey items that could be identified by source … 36 prey items (43%) were terrestrial and 48 (37%) were aquatic. This difference was statistically significant (X2, p=.01)”. The hypothesis being tested was never clearly specified (e.g. Is the hypothesis that the proportion of prey items from terrestrial and aquatic should be 50:50?) and so the p-value cannot be interpreted. The authors appear to use some sort of chi-square test, but appear to have engaged in sacrifical pseudoreplication (Hurlbert, 1984) by ignoring the structure of the sampling protocol with sample from different sites being pooled prior to analysis. Perhaps the intent was to examine if the prey distribution (terrestrial vs aquatic) varied among exposure areas? among sites? 2 3.2 Analysis of Mean Se in egg vs. Se in water (Figure 13) The authors fit a curve comparing the concentration of Se in the egg contents to the Se concentration in the water of the various sites. There are a number of difficulties in their analysis. In order to avoid pseudo-replication (Hurlbert, 1984), the mean concentration of the Se for all eggs measured in a site was used. This is appropriate. However, the authors failed to account for potential multiple measurements on each site over the 3 years (i.e. if a site was measured for 3 years, the 3 averages were used). A simple regression model will not be appropriate. Also, Figure 7 shows that the variability in aquatic concentration of Se is large over the 3 samples (high standard error in the mean), but only the mean value was used along the X axis. No account for the uncertainty in the aquatic Se concentration was allowed for. This is a difficult problem (called the error-in-variables problem) when the X measurements are subject to large uncertainties and ordinary regression methods are no longer appropriate. The authors never present the fitted equation, and state in the legend of Figure 13, that a polynomial equation was fit, but in the text (page 34) claim that “The asymptote of the curve (the level that the curve approaches but does not quite reach) was about 24 mg/kg dry weight. Above about 80 ug/L aqueous selenium, further increases in MES would be neglible.” However, a polynomial model (e.g. a quadratic) does not have an asymptote and does reach a maximum which is much greater than the apparent limit in this plot. 3.3 Analysis of Mean log(Se) in egg vs exposure (Figure 15) The authors claim to have done a multivariate comparison on the effect of year and site on Se concentration using MANOVA but MANOVA is not the appropriate tool in this case. Not all sites are measured in all years and the default analysis of MANOVA is casewise deletion, i.e. sites that are not measured in all years will be deleted. More modern ANOVA approaches can (and should be used) to allow for this missing data. 3.4 Analysis of Mean log(Se) in egg across years (Bottom of page 35) The authors do a comparison over all sites of the mean log(Se) concentration across years. I was able to reproduce their results using an incorrect model that did not account for having the same site repeated measured over time, did not account for the different mix of sites and exposure levels in the three years, and did not account for the multiple eggs measured on the same nest. The authors conclusion reflects a shift in monitoring effort across the years among sites (and exposure levels). 3 The proper model needs to account for the hierarchical structure and takes the form (in a standard model notation) log(Se) = year site(r) nest(site*year)(r) where the (r) indicates the random effect of site or nest. Under this correct model, the conclusions are different that reported by the authors with statistically significant differences in the mean log(Se) concentration detected between 2004 and 2005. No difference in the mean log(Se) concentration between 2005 and 2003 was detected because of the much sample size in 2003 compared to 2004 and 2005. 3.5 Analysis of Mean log(Se) in egg across sites (Bottom of page 36) The authors performed a comparison of mean log (Se) across sites (colonies) as reported in Figure 16. Their analysis fails to account for the multiple eggs measured in each nest. The standard error bars in Figure 16 are too small and the comparison of the mean log(Se) level among the site have an increase Type I error (false positive rate. The proper model accounts for the multiple eggs measured in a single nest (i.e. there may be some ecological factor so that the eggs within a single nest are not independent in the Se levels). In standard model notation the proper model is: log(Se) = site nest(site)(r) where the (r) indicates the random effect of nest upon the multiple eggs within the nest. Fewer statistically significant effects are found – many of the reported differences among sites in Appendix Table 15 are false positives and the reported se are too small. The authors should also replace their text on the bottom of page 36 with a suitable graphic such as joined-line plot (refer to Section 6.6.7 of http://www.stat.sfu.ca/~cschwarz/Stat-650/Notes/PDF/Chapter06.pdf). 3.6 Analysis of mean log(Se) in egg across exposure level (Bottom of page 37) The authors performed a comparison of mean log(Se) concentration across exposure levels. I was able to reproduce their results using an incorrect model that did not account for having multiple sites within each exposure site and did not account for multiple eggs from the same nest in each site. The authors conclusions reflect both the effects of exposure level and differences in sampling effort among the sites within exposure level. For example, in Figure 16, it is quite apparent that the LCM site has much lower Se concentrations than the CP and GM sites but all are within the same high exposure category. The proper model needs to account for the hierarchical structure and takes the form (in standard model notation): log(Se) = exposure site(exposure)(r) nest(site*exposure)(r) 4 where the random effects of multiple sites within each exposure level and nests within each site are accounted for. Under this correct model, there is in fact NO evidence of a difference in mean log(Se) levels across exposure groups – mainly because the very high variability in mean log(Se) levels across the High exposure group. The same incorrect model was used in previous years (Appendix Table 13) and many of the conclusions are actually not supported by the data. For example, there is no evidence of a difference in the mean log(Se) concentration across exposure level in any year. The revised model above needs to be fit for each year of the data. 3.7 Comparison of clutch sizes across year (Page 39) As in the analysis of the mean Se levels, this analysis fails to account for the hierarchical structure of the data with multiple nests in multiple colonies in each year, and not every colony is measured in every year or with the same intensity. The reported standard errors are too small. The authors used a simple ANOVA, but a Poisson regression model that recognizes that the data are discrete small counts may be more appropriate. 3.8 Comparison on Hatchability (Page 40) As in the analysis of the mean Se levels above, this analysis fails to account for the hierarchical structure of the data with multiple colonies in each exposure category, multiple nests in multiple colonies in each year, and not every colony is measured in every year or with the same intensity. The reported standard errors are too small. Additionally, the authors used ANOVA on the hatchability rate (ratio of eggs that hatch to eggs monitored). This will be approximately correct, but is old fashioned. A more appropriate approach is to use generalized linear mixed model (GLIMM, logistic regression with random effects) that mimics the mixed models for the mean Se analyses. Logistic regression is the (now) standard way to analyze data that presents itself as a proportion (i.e. proportion of eggs that hatch). A logistic regression approach would also avoid the obvious problems in Figure 21 where the confidence intervals for the hatchability exceed 100%! 3.9 Comparison of nestling mortality and survival (Page 41) As in the comparison of hatchability, the authors used incorrect models for this analysis and should use a more modern GLIMM approach. 5 The authors also reported the results of a chi-square and Fisher Exact test comparing the survival rates across the exposure classes. These analyses are incorrect and are and example of sacrificial pseudo-replication (Hurlbert, 1984). The problem is that that the traditional chi-square and Fisher’s Exact test cannot be conducted on data that is collected in the hierarchical fashion – it is not valid to simply pool across the colonies within the exposure levels. A GLIMM analysis must be done here. The comparison on nestling mortality across sites (Figure 23) using ANOVA is also inappropriate. Here the very small counts require the use of Poisson ANOVA. The use of Poisson ANOVA would also avoid confidence intervals less than zero. 3.10 Comparison of Egg Health The analysis of egg weights (Figure 26) is incorrect because of the failure to use a proper linear mixed model. See comments earlier in this report. Similarly, the comparison of Se levels among the hatched and failed eggs is also incorrect. 3.11 Comparison of Nesting survival by half of season. The authors should use GLIMM logistic regression method here rather than ANOVA and need to account for the paired nature of the data. These methods will again avoid confidence intervals that exceed 100%. 3.12 Comparison of Productivity, Hatchabilty, and Nestling Survival vs Selenium Levels The authors need to use Poisson regression or logistic regression model for number o failures and hatchability respectively rather than ordinary regression. For example, ordinary regression methods would allow the number of failed eggs to fall below 0 (which is impossible) and the hatchability to exceed 100% (which is also impossible). Similarly as noted above, the hierarchical structure of the data collection needs to be incorporated into the analysis. 3.13 Use of log() transform and intepretation of analysis The authors used a logarithm base-10 transformation of the wet Se concentration in the analysis. The use of the base-10 logarithms vs the natural (base e) logarithm does not affect the conclusions other than the log-10 values are about 2.3x larger than the natural logarithm values. However, the ANOVA on the log() values tests the hypothesis that the mean log(Se) concentration is the same among groups. However, this corresponds to the MEDIAN Se 6 value on the anti-log scale but the authors state their conclusion about the mean Se value on the anti-log scale. 3.14 Use of Bonferonni Multiple Comparison Procedure The authors used a Bonferonni multiple comparison procedure (MCP) to control the experimentwise error rate. This MCP takes the simplistic approach that if every comparison has a .05 chance of a type I error (false positive), then k comparison will have a k(.05) chance of at least one type I error in the set. So if there are 3 comparisons (e.g. among exposure classes), the overall error rate is 3(.05) or .15. This is unacceptably high. The Bonferonni approach declares each individual comparison statistically significant only if the p-value is less than (.05)/k, i.e. makes it harder to detect a significant difference on each individual comparison so that the overall error rate is controlled at the .05 level. The Bonferonni procedure is too conservative because it treats each comparison as being independent of each other. But the pairwise comparisons are not independent. For example, when comparing effects across exposure levels, the High vs Low, High vs Reference, and Low vs. Reference comparisons use each level twice in the set of three comparisons. In these case, a preferred multiple comparison procedure is the TukeyKramer procedure which is available in all standard statistical packages. 4. Conclusions: The major problems in the statistical analyses in this report are (a) Failure to explicitly state the statistical model that was fit to the data. For example in Figure 13, what was the equation that was fit? How are effects of multiple measurements on the same site over 3 years taken into account? In the analysis of the Se concentration in eggs vs exposure levels, the statistical model was never explicitly stated. The computer code used to fit the model for each analysis needs to presented in an appendix so that reader can verify that the correct models have been fit. (b) Failure to account for the hierarchical nature of the sampling protocol. The statistical model needs to match the way the data are collected. For example, there are multiple sites within each exposure level. There are multiple nests within each site. There are multiple eggs measured from each nest. The models in this report fit to the individual egg values do not account for this hierarchical sampling scheme. This implies that the measurements at the lowest level in the hierarchy are pseudoreplicates and not the “experimental units”. For example, in comparison of Se concentration among exposure levels, the site is the experimental unit, and not the egg. The (incorrect) models used by the authors often confound differences in sampling effort across year with exposure or year effects. 7 These (incorrect) models that treats the egg as the experimental unit leads to standard errors that appear to be too precise (i.e. smaller than can be justified by the data) and inflated Type I error (false positive) rates (i.e. too many conclusions about the existence of effects that can be justified by the data). Some of the problems can be resolved by taking successive averages, e.g. average the Se from multiple eggs in the same nest; average the averages for multiple nests within the same site. However, this analysis will only be approximate, and a linear mixed model ANOVA should be used as outlined above. (c) Use of ANOVA/Regression instead of logistic or Poisson ANOVA/regression In cases where the response variable is a count of success/failure, a better approach is the use logistic ANOVAa/regression rather than simple ANOVA. The logistic approach must, of course, properly account for the hierarchical structure of the data collection and this can be done with Generalized Linear Mixed Models (GLIMM) the extension of mixed model to generalized linear models. The logistic approach will properly account for the discrete nature of the data and that survival rates must be between 0 and 100%. Similarly, the analysis of data with very small counts should be done with a Poisson ANOVA/regression approach. Again, this needs to account for the hierarchical structure of the data. The poisson approach properly accounts for the discrete nature of the data and that counts must be non-negative. References: Hurlbert, S. H. (1984). “Pseudoreplication and the design of ecological field experiments". Ecological Monographs 54 (2): 187–211. doi:10.2307/1942661. 8