DRAFT Nonparametric Statistical Testing for Baseline Risk Assessment Screening 1.0 Introduction The purpose of baseline risk assessment (BRA) screening at a possibly contaminated site is to focus investigatory resources. An essential aspect of this is to determine which analytes merit further study in the risk assessment process; that is, to identify the chemicals of potential concern (COPC) for the site. The COPC list begins as a very large list. It is whittled down based on the results of BRA screening. Analytes that are not detected in any samples are dropped from further consideration. Analytes that are known to be common laboratory contaminants and for which there is evidence that all apparent detections may be due to laboratory contamination are considered for removal from the list of COPCs. Analytes that are known to be naturally occurring and appear not to have elevated site concentrations relative to background (BG) are removed from the COPC list. This is the main point of application of statistical testing to the problem of BRA screening, although one could certainly argue that statistical methodology should be applied to the problem of screening common laboratory contaminants. Two important features of statistical testing are false positive error rate and false negative rate. The false positive rate is the probability of making a false positive (Type I) error. In the setting of BRA screening, the statistical distributions of site concentrations are assumed to come from the same population as BG unless there is sufficient evidence to the contrary. A false positive error is the false conclusion that site concentrations of the analyte tend to be higher than BG concentrations. The false positive rate is the client’s risk of having to perform unnecessary laboratory testing (and perhaps cleanup). The false negative rate is the probability of making a false negative (Type II) error. In the setting of BRA screening, a false negative error is the false conclusion that site concentrations of the analyte do not tend to be higher than BG concentrations and that, consequently, the analyte does not merit further consideration in risk assessment or remediation. 1.1 Some Notation This section briefly discusses some concepts and notation needed. The shift or difference model is: X i ~ F( x), iid for i 1, , m, Y j ~ F( y ), iid for j 1, , n, and X i is independent of Y j for all i and j. [1.1] DRAFT F() is a probability distribution function1. The symbol “~” in the first line above denotes that F() is the common probability distribution function of the Xi. Also, “iid” means “independent and identically distributed”. A null hypothesis is a statement regarded as true unless there is sufficient evidence to the contrary. It is typically denoted as H0. The alternative hypothesis, H1, is the statement accepted when H0 is rejected. The size or false positive rate () of a test is the probability of rejecting H0 when H0 is actually true. The false negative rate () of a test is the probability of failing to reject H0 when H0 is actually false. The Mann-Whitney-Wilcoxon Test is often used under the shift model to test H 0 : 0 versus [1.2] H1 : 0. The power of a test is the probability that H is rejected given that a specific simple alternative H1* is true, which of course means that H is false. The power of a test is a function of the parameter value in a simple alternative hypothesis. The power of a test is 1-, where is understood to depend on the alternative parameter value. The alternative hypothesis here is a composite alternative; more than one value of is consistent with H1. By contrast, a simple alternative hypothesis specifies a single value of the parameter. An example of a simple alternative hypothesis is H1*: = 1, where the distribution F() is known. If F() is unknown under the shift model, the alternative hypothesis H1* is no longer simple; since the space of distribution functions is infinite dimensional, the parameter space represented by = 1 with F() unknown is actually infinite dimensional with F() as an infinite dimensional nuisance parameter. 2.0 What Kinds of Tests Are Appropriate? The observations of site or BG data are often positively spatially dependent. Results from nearby sample locations (at the same depth) tend to be more similar than results from distant sample locations. Near a “hot spot” observations of site data may tend to be negatively spatially dependent over a short distance. Results from nearby sample locations (at the same depth) tend to vary widely. Generally, environmental data (site or BG data) are neither independent nor identically distributed. However, under certain sampling designs and for certain purposes we may justify the use of the iid model. Let F() represent the probability distribution of measurements obtained at random locations in a designated area A. Then F() is known as the global distribution function on the area A (Isaaks & Srivastava, chapter 18) or the spatial distribution (Journel). It is the mixture of the probability distribution functions at each point in area A. The results of sampling at random locations in area A, when location is ignored, behave as iid observations from the global distribution F(). The global distribution is a mixture distribution. This is apparent if the local distributions at each point in area A are not identical. The distribution of measurements obtained 1 A probability distribution function is a nondecreasing, right-continuous function such that F(-) = 0, and F() = 1. 2 DRAFT at random locations in a designated area A is also a mixture distribution. Therefore, for testing global hypotheses concerning the area A or estimation of global parameters of the area A, the ordinary statistical procedures used in the iid case can be used correctly, although they may not be very efficient compared to other procedures. If the quantity of interest has spatial dependence over area A, then estimation based on sampling with a randomized regular grid is more efficient (gives estimates with smaller variance) than estimation based on simple random sampling selection of sampling locations (Cochran, pp. 227-228) and an improved estimate of the global distribution function may be obtained from data gotten by sampling at random locations by using spatial declustering weights (Isaaks & Srivastava, chapter 18). For sampling on a randomized regular grid, the spatial declustering weights are all equal2 like the weights in the ordinary empirical distribution function. This argues that the empirical distribution function based on observations from a randomized regular grid converges more quickly to the global distribution function than does the empirical distribution function based on of sampling at random locations. Here we are not considering averaging over an infinite number of realizations of a spatial random process but are only considering the realization at hand. The observations are indeed random variables in this case because the sampling locations are randomized and the observations are subject to random collection and measurement errors. They are also spatially dependent because their means are a function (albeit an unknown function) of location. It is well known that in the iid case, the empirical distribution function converges to the true distribution with probability 1 (wp1) as the number of observations goes to infinity. It is also the case that the empirical distribution function based on sampling at random locations in area A converges to the global distribution function wp1 as the number of observations goes to infinity. Since the area A is fixed and bounded, this means under infill asymptotics. Because sampling on a randomized regular grid is more efficient than sampling at random locations, it must also be true that the empirical distribution function based on sampling from a randomized regular grid over area A converges to the global distribution function at an even faster rate. This means it behaves like the empirical distribution function based on sampling at a larger number of random locations in area A. This is certainly worth further investigation. Therefore sampling on randomized regular grids is virtually always to be preferred over sampling at random locations. Note that this behavior (of approximating the global distribution) may be quite different under increasing domain asymptotics or considering multiple realizations of the spatial random process, where ergodicity of the process comes into play. However, in these cases also, sampling on randomized regular grids is the preferred method of sampling. 2 This is true for global kriging weights and for polygon declustering weights, except perhaps for points on the convex hull of the sample grid. Cell declustering weights for a regular grid should be very close to equal. 3 DRAFT Statistics based on sample moments or ranks are functions of the empirical distribution function of the sample. Hence, tests based on such statistics should behave as if they were computed from a larger number of observations. Therefore their actual significance level () should be slightly lower than nominal and their actual power should be slightly higher than nominal. This by no means excludes the construction of more powerful tests and more efficient estimators by explicitly using the spatial nature of the data. On the other hand, the entire population of possible samples on a fixed site is finite, although it is quite large. Most common statistical procedures, classical, Bayesian or geostatistical, assume an infinite population model, which is a good approximation to the real situation. Proponents of the nonparametric design-based approach to sampling (Cochran; de Gruitjer & ter Braak) argue that the classical sampling theory for finite populations, as developed by Neyman, does not require independence. Viewing the realization of area A as fixed and finite, tests and estimates appropriate under finite sampling theory should be serviceable for comparison to BG. The improvement due to sampling on a randomized regular grid is presented by Cochran as equivalent to the improvement due to cluster sampling. 3.0 Power of Statistical Tests 3.1 Sample Size Planning Before any sampling is conducted, sample size (that is, number of sample locations) determinations must be made, both for BG and for the sites being studied. Desired power against specified alternatives is an essential element of sample size determination for comparison to BG. We will use sample size determination for the two-sample t-test as an example. Let , X m ~ iid N 0 , 2 and X1, Y1 , [3.1] , Yn ~ iid N 1 , 2 . Then the sample means are distributed as: Xm Yn 1 n 1 m m i 1 X i ~ N 0 , 2 m and [3.2] Y ~ N 1 , 2 n . j 1 j n Also, X m is independent of Y n . Consequently, Y n X m ~ N , m1 1n 2 , where =1 0 . [3.3] Then the shift hypotheses are: H 0 : 0 1 0 versus H1 : 0 1 0 0 . 4 DRAFT Define T Yn X m m1 1n mmn n Yn X m . Then T ~ N mmnn / ,1 , and under H0 T ~ N 0,1 . To test H0 versus H1, we reject H0 for large values of T. The significance level (Type I error rate) of the test is determined by the equation P T z1 H 0 . For a specified * > 0, let H1* denote the hypothesis that = *. The power of the test of H0 versus H1* is determined by the equation: P T z1 H1* P Z 1 z1 mn m n mn m n z1 P Z z1 mn m n 1 , where Z is standard normal random variable, is its distribution function, and = /. Then z1 mn m n z1 mn m n z z1 mn m n mn m n z1 z1 z1 z1 2 . 2 Actually, we do not know and must estimate with it with S. An approximate but accurate correction for this is given in EPA QA/G4, page 63. Using this correction, we get mn m n z1 z1 2 0.5z12 . 2 Note that 0 0 f CV, where f is the fraction increase over BG that we wish to detect with probability 1-, and CV is the coefficient of variation (relative standard deviation) for BG. Then we have mn m n CV 2 z1 z1 2 f 2 0.5 z12 . For instance, we may want to plan to be able to detect a 50% increase above the BG mean with a probability (level of confidence) of at least 0.8. Then we have = 0.2 and = 0.5/CV. Now we may wish use equal sample sizes for site and BG; that is, m = n. Then we have mn m n n2 CV2 z1 z1 2 f 2 0.5 z12 m n 2CV 2 z1 z1 2 f 2 z12 . On the other hand, if we fix the number of BG samples at m = M, we must also fix a lower bound for n, say n0 = 7. Then we have , CV2 z1 z1 2 Mn M n CV 2 z1 z1 f 2 0.5 z12 . Then we get Mn M n 2 f 2 . For convenience, set M . M , , n M , , Finally, we estimate the number of site samples by n* max n0 , M , , where x means the smallest integer that is greater than x. By the Central Limit Theorem, 5 DRAFT the distribution of the mean of our data will be asymptotically normal. So, even if the data is non-normal, these sample number calculations should be at least approximately correct. For purposes of planning when the Mann-Whitney-Wilcoxon test is to be used for comparison to BG, we make use of the asymptotic relative efficiency (ARE) of the Mann-WhitneyWilcoxon test relative to the two-sample t-test. When the underlying distribution is normal, this ARE is about 0.955 (3/ exactly). So we need about 20 samples for every 19 that we calculate or nw 1.05n* . In summary, the power of a statistical test is something that absolutely must be planned up front as part of the sampling design and incorporated into the sampling plan (Field Sampling Plan, Data Acquisition Plan or QAPP). This requires decisions about the level of confidence we wish to have against false positive errors (1-), the specific alternatives which we wish to protect against (H1*) and the power or level of confidence we wish to have against false negative errors (1-). Finally, the decisions listed above must be balanced against the costs of sampling and analysis and the costs of decision errors. 3.2 Retrospective After the data have been collected, analyzed, statistical tests have been performed and decisions made, it is possible to go back to the data to estimate the power achieved by the statistical tests. Although this sort of retrospective look at power is controversial among statisticians, it has been presented by USEPA as part of verifying the achievement of the Data Quality Objectives (DQO) goals for the project. This is demonstrated for the two-sample t-test in EPA QA/G9, pages 3 through 6 of section 3.3. Verifying that the correct statistical tests have been used is also part of this process. The DQO process is a planning process developed by USEPA and mandated for use in all projects in which environmental measurements are collected. For further information, refer to the relevant USEPA documents (EPA Order 5360.1, EPA QA/G4, EPA QA/G9). 4.0 Statistical Procedures 4.1 Properties of the Mann-Whitney-Wilcoxon Test The Mann-Whitney-Wilcoxon test has two equivalent forms of the test statistic. The Wilcoxon form is the W statistic that is widely used. W is the sum of the ranks of the Y observations when the ranking is done over all the data. The Mann-Whitney form is 6 DRAFT 1 U mn I X i Yj fraction of time that an X i is less than a Yj , where m n i 1 j 1 0, if the event A does not occur I A is the indicator function. 1, if the event A is occurs The expected value of U is E U P X Y , the probability that a randomly selected value from the X population is less than a randomly selected value from the Y population. W and U n n 1 are related by W mnU 2 . We can therefore use the Mann-Whitney-Wilcoxon statistic to test the stochastic ordering hypotheses H 0so : P X Y 1 2 versus H1so : P X Y 1 2 or versus [4.1] H1so* : P X Y 2 3 , for instance. Stochastic ordering is defined by: Let X be a random variable with a probability distribution function F() and Y a random variable with a probability distribution function G(). The Y is stochastically larger than X if F t G t for all t, with strict inequality for at least one t.3 So a graph of G(t) against F(t) would be on or below (no part above and at least some part strictly below) the diagonal. H0so says that Y observations (the site measurements) do not tend to be larger than X observations (the BG measurements). H1so says that Y observations (the site measurements) do tend to be larger than X observations (the BG measurements). H1so* says that 2 out of 3 times (on average) a randomly selected site measurement will be larger than a randomly selected BG measurement. These hypotheses are more general than the shift (difference) hypotheses and make more sense for comparison to BG. One way to calculate the power of the Mann-Whitney-Wilcoxon test for testing H0so against the alternative H1so* is to use the Two Sample U-Statistic Theorem (Randles & Wolfe, p. 92) to find a normal approximation for the standardized distribution of W under H1so*. Under this n n 1 theorem, mmnn W mn 2 is approximately normally distributed with mean 0 and var iance m2 ,n 1,0 0,1 , where m,n 1 m,n m m ,n = m+n , 3 Randles & Wolfe, pp. 130-131. 7 DRAFT 1,0 P X 1 min Y1 , Y2 2 , 0,1 P max X 1 , X 2 Y1 2 , and [4.2] = P X 1 Y1 . In order to get a reasonable estimate of m2 ,n for a practical problem, one must specify distributions of X and Y that are reasonable for the problem. In nonparametric procedures, the null distributions of the test statistics do not depend on the underlying distribution of the data, but the alternative distributions typically do. Unfortunately, even for the simplest choices of distributions for X and Y under the alternative H1so*, the calculation of m2 ,n is involved. For example, choosing the X distribution to be the continuous uniform distribution on the interval [0,1] (written as U[0,1] ) and choosing the Y distribution to be U[1/3, 4/3], so that P(X<Y) = 2/3, it takes about almost a page of theoretical calculations to determine that 1,0 0,1 812 . By contrast, under H0so, 1,0 0,1 121 . For nonsymmetric distributions (which are typical of environmental data) 1,0 and 0,1 are not equal. Often, m2 ,n is estimated by Monte-Carlo simulation. Fortunately, there is another way to calculate the power of the Wilcoxon and other rank tests exactly, under a restricted family of alternative hypotheses known as the Lehmann alternatives. The Lehmann alternatives are discussed in section 4.2. Despite the difficulties of calculating its power, the one-sided Mann-Whitney-Wilcoxon test is a very good test and in most cases a superior alternative to the two-sample t-test. It is unbiased and consistent against shift alternatives and stochastic ordering alternatives. Since it is nonparametric4 distribution-free5, the actual Type I error level is always equal to the nominal level. When the data is from normal populations, the power of the Mann-Whitney-Wilcoxon test is almost as great as that of the two-sample t-test. For continuous distributions, the power of the Mann-Whitney-Wilcoxon test may be much better than but is never much less than the power of the two-sample t-test (Randles & Wolfe, pp. 117-119, 163-171). Both the shift and stochastic ordering alternatives are important in BRA screening. Detection of site “hot spots” or high concentrations (relative to BG) is also important in BRA screening. The right-shift alternative corresponds to widespread (that is, high probability) anthropogenic6 impact at a relatively constant level. The “hot spot” alternative corresponds to a high concentration impact over a small area (low probability). The stochastic ordering alternative encompasses both of these alternatives and intermediate cases as well. Testing for a “hot spot” alternative is essentially equivalent to testing for stochastic ordering in the right tails of the distributions. Although the Mann-Whitney-Wilcoxon test is consistent against all stochastic ordering alternatives, it was not designed to detect stochastic ordering in the right tails and has relatively 4 Nonparametric indicates testing hypotheses that are not a function of a parameter specific to a particular distributional form. 5 Distribution-free indicates that under the null hypothesis, the distribution of the test statistic does not depend on the underlying distribution of the data. 6 Anthropogenic means caused or created by man. 8 DRAFT low power against this sort of alternative. Consequently, another test is needed to augment the Mann-Whitney-Wilcoxon test in order to detect “hot spots”. Fortunately, both the shift and stochastic ordering in the right tail alternatives can be formulated as Lehmann alternatives. Lehmann alternatives and these formulations are discussed in the next section. 4.2 Lehmann Alternatives Consider the following model and compare it to the shift (difference) model in section 1.1: X i ~ F( x), independent for i 1, , m, Y j ~ g F( y) , independent for j 1, [4.3] , n, where g() is a nondecreasing function on [0,1] such that g(0) = 0 and g(1) = 1. The hypotheses of interest for comparison to BG may be posed as H 0 : g u u versus [4.4] H1 : g u g* u , where g*() is constructed to satisfy the alternative of interest. Any g*() such that g* u u , with strict inequality for some u, specifies some version of stochastic ordering. The nice thing about the Lehmann alternatives is that, as long as F() is continuous, the distribution of any rank statistic7 under the alternative hypothesis does not depend on F() but only on g*(). Thus, the distribution of any rank statistic under a Lehmann alternative is nonparametric distribution-free. Consider two cases of Lehmann alternatives: (Case A): For the shift (difference) alternative, g* u; u1 , 0 u 1 , is a reasonable construction for non-negative data. The hypotheses for the shift model become H 0 : 0 versus H1 : 0. For = 0.1, for instance, g*(u; 0.1), which represents the specific alternative hypothesis H1*: = 0.1, has the graph shown at right. The dashed line is the graph of g(u) = u, which represents H0. Note that, in line with the definition of stochastic ordering, the graph of g*(u; 0.1) lies below the diagonal. Figure 1: Example Case A Lehman Alternative 1 0.8 0.6 0.4 0.2 If one has in mind a specified shift, say , that one wishes to detect with specified probability, and a specific distribu* 7 A rank statistic is a statistic that is a function only of the ranks of the data. 9 0 0 0.2 0.4 0.6 0.8 1 DRAFT tion F() for BG (modeled from the data perhaps), then * = F(*) gives what is in essence a simple alternative hypothesis against which to evaluate power. In the case of the exponential distribution with mean , produces a shift of = -ln(1-), which is a convenient multiple of the mean. Lehmann (Lehmann 1953) used Hoeffding’s Lemma to calculate the distribution of rankbased tests under Lehman alternatives. The power of the Mann-Whitney-Wilcoxon test under the Case A Lehmann alternative is most easily derived by computing the mean and asymptotic variance of W under the alternative and using the normal approximation given by the Two Sample U-Statistic Theorem. We have E1 U 1 2 , E1 W mn n n 1 2 [4.5] n m n 1 mn 2 , [4.6] 1,0 13 2 121 1 6 3 2 , and [4.7] 0,1 2 13 1 2 2 121 1 2 13 2 . [4.8] W n mn21mn is approximately normally distributed with mean 0 and variance Then m n mn m2 ,n 1,0 1 6 3 2 1 2 13 2 0,1 m n . m,n 1 m,n 12n 12m [4.9] So, for instance, if we take m = 15, n = 10, and = 0.5, we have E1 W n m n 1 mn 2 101510 1 15100.5 2 167.5 , and 1 6 0.5 3 0.25 1 2 0.5 13 0.25 0.8285 . 12 15 12 10 m2 ,n 25 The 0.05 critical value of W, with m = 15 and n = 10, is 160, so we estimate the power to be P1 W 160 1 1510 160 167.5 1510 0.8285 1 0.275 0.608. (Case B): For the stochastic ordering in the right tails alternative one very reasonable construction has g* u; 1 u u h for 0 1, and h > 1. This is a mixture distribution representing BG interspersed with areas of elevated contamination; that is, moderate “hot spot” contamination. This interpretation is particularly appropriate if random or regular grid sampling has been used. For instance, = 0.5 represents the hypothesis that half of the area is impacted by moderately higher concentrations. 10 DRAFT The graph of g*(u;) for the case with = 0.2 and h = 4 is shown at right. The dashed line is the graph of g(u) = u, which represents H0. Fixing , the hypotheses for the “hot spot” contamination model become 1 0.8 0.6 0.4 H 0 : h 0 versus 0.2 H1 : h 0. 0 0 which is very similar to the shift hypothesis case. 4.3 0.2 0.4 0.6 0.8 1 Figure 2: Example Case B Lehman Alternative Rosenbaum’s Two-Sample Location Test A version of Rosenbaum’s two-sample location test statistic is T, the number of Y values larger than the largest X value. This statistic is very useful for testing for stochastic ordering in the right tails. The null distribution of T is P0 T t mnntt 1 mnn , t 0,1, ,n, [4.10] where m is the number of BG samples, n is the number of site samples, and m t denotes the number of ways that m objects can be selected t at a time. The critical value for the level- test is the smallest integer C such that P0 T C t C P0 T t t C mnntt 1 n n mnn . Sukhatme (Sukhatme 1992) has demonstrated a simpler method of computing the power of rank-based tests under Lehman alternatives, using distribution-free exceedance statistics. Sukhatme’s methodology and results were used to derive the results on power presented below. A copy of a preprint of Sukhatme’s paper is attached to this report. The power of Rosenbaum’s two-sample location test under the Case B Lehmann alternative is n n t C t C given by 1 P1 T C P1 T t 1 hm nt B hm t , n t 1 , where B(a,b) is the Beta function, defined by B a,b a b a b , and () is the gamma function, defined by a u a 1eu d u . 0 Under the “hot spot” model, high power can only be achieved if a substantial fraction of the area is contaminated. This is true for any rank based test and arises from the term 1 which will appear in the formula for the power against the BG with “hot spot” contamination model. Nevertheless, Rosenbaum’s test is certainly more sensitive to high values than the Mann-Whitney-Wilcoxon test is. However, it does not provide protection against the situation in which a relatively small fraction of the area is contaminated but at high concentrations. This alternative can be protected against in a nonparametric and asymptotically distributionfree manner by using a probability inequality that holds for all distributions. This is discussed in the next section. 11 DRAFT 4.4 Chebyshev’s Inequality Chebyshev’s inequality is probably the most familiar probability inequality. It states that, for any random variable X with mean and variance 2, P X 1 2 for 0. Of course this is only useful for > 1. Given a sample of m BG observations X1 , , X m iid from an unknown distribution F() with finite variance 2 and given Y, a new independent observation from F(), we have E Y X 0 and Var Y X mm1 2 . From this, we obtain P Y X 10 P Y X 1 for > 0. Setting = 10 gives 0.01 which holds for all probability distributions with finite variance. P Y X m1 m m1 m 2 m1 m In this class of distributions, X is a consistent estimator of , and S a consistent estimator of . Then Cn X 10 P X X 10 m1 m m 1 n 1 m 1 X m i 1 i X 2 is S is a consistent estimator of + 10, and S converges in probability to P X 10 , by Slutsky’s Theorem. Therefore, Cn may be taken as an estimated upper bound to a 99% confidence Upper Prediction Limit (UPL) for BG observations. It provides a nonparametric and asymptotically distribution-free upper limit with protection against extreme “hot spots” that rank-based procedures cannot provide. This works because the limiting inequality, P Y X 10 m1 m P S P X 10 F 0.01 (where F depends on the BG dism tribution F) holds for all BG distributions F. To set to achieve a specified (conservative) level , set 1 . For instance, 10 1 0.01 , 6.32 1 0.025 , and 4.47 1 0.05 . Applying this limit to each of n independent site observations, the overall false positive rate is n n o 1 1 F n 1 1 . For n = 10 and = 0.01, o 0.096 0.1 . Therefore, 1 n 1 is a lower bound for the confidence coefficient of Cn as a UPL for n addin tional BG observations. Other conservative 95% Upper Prediction Limits for BG observations can be derived from other probability inequalities. In particular, one based on Markov’s inequality, using the third absolute central moment of the BG distribution, gives limits that are a little tighter than Cn. However, this requires estimation of the third moment of the BG population, which cannot be estimated from small datasets nearly as accurately as the standard deviation. 5.0 Recommendations As with all statistical analyses, the first step should be to plot the data in various ways in order to become acquainted with its large-scale features. This is also good insurance against data entry errors and against errors in applying or interpreting statistical tests. A couple of example 12 DRAFT data plots appear on the pages following the text of this paper. They were created using SPLUS 4.5. From the first plot, it is evident that chromium (Cr) and manganese (Mn) concentrations are more closely related in the site dataset than they are in the BG dataset. This could be explored using various multivariate statistical techniques, cross-variogram analysis (a spatial statistics technique), geochemistry (if a mineralogical analysis of the soils has been done), and comparison to chemical analysis of the wastes that may have impacted the site. Another noteworthy feature of the data plots is that, except for the two site locations which are clearly higher than BG, the site data shows substantially less spread for Cr and Mn and a tighter association between Cr and Mn than does the BG data. It is often the case that BG sample locations tend to be more widely spaced than site sample locations. Spatially correlated regional variation combined with larger average sample location spacing for BG can easily explain this phenomenon. This phenomenon may invalidate the shift model but not the stochastic ordering model. The second step is to apply statistical screening. We recommend a three pronged approach to screening, simultaneous testing using: the Mann-Whitney-Wilcoxon test to detect widespread low-level anthropogenic contamination (the shift alternative or, more generally, stochastic ordering), Rosenbaum’s two-sample location test to detect localized higher concentration anthropogenic contamination (stochastic ordering in the right tails), and a UPL for n additional observations based on Chebyshev’s inequality to protect against extreme “hot spots”. We recommend using the Bonferroni inequality to set the level of the combined procedure based on using the rank tests. For an overall level of comb = 0.05, we recommend using = 0.02 for the Mann-Whitney-Wilcoxon test and for Rosenbaum’s two sample location test and using = 0.01 for a (1-)100% = 99% UPL based on Chebyshev’s inequality. The levels of the individual tests then sum to 0.05 to give an approximate overall level of 0.05. 6.0 References Cochran, W.G. (1977). Sampling Techniques: Third Edition. New York, John Wiley & Sons, 428 pp. Cressie, N. A. C. (1991). Statistics for Spatial Data. New York, John Wiley & Sons, 900 pp. De Gruitjer, J.J. and ter Braak, C.J.F. (1990). Model-Free Estimation from Spatial Samples: A Reappraisal of Classical Sampling Theory. Mathematical Geology. Vol. 22. pp. 407-415. Isaaks, E. and Srivastava R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press, New York. 561 pp. 13 DRAFT Journel, Andre (1983). Nonparametric Estimation of Spatial Distributions. Mathematical Geology. Vol. 15. pp. 445-467. Lehmann, Erich (1953). The Power of Rank Tests. Annals of Mathematical Statistics. Vol. 24. pp. 23-43. Randles, R.H. and D.A. Wolfe (1991). Introduction to the Theory of Nonparametric Statistics. Krieger Publishing Co., Malabar, Florida. 450 pp. Rosenbaum, S. (1954). Tables for a Nonparametric Test of Location. Annals of Mathematical Statistics. Vol. 25. pp. 146-150. Sukhatme, Shashikala (1992). Powers of Two-Sample Rank Tests Under the Lehmann Alternatives. The American Statistician. Vol. 46. pp. 212-214. 14 DRAFT Figure 3: Example Site/BG Scatterplot v s . Ba C5.0r(mg/k) 10. 50. Site 0.5 1.0 B G S i te 5 1 05 1 0 05 0 1 0 0 0 5 00 00 0 M n 15 (m g /kg ) DRAFT Figure 4: Example Site/BG Boxplots 9 7 5 4 3 2 M n b y L o c a t i o n C r b y L o c a t i o n 3 1 0 7 5 4 3 2 mg/k 2 1 0 7 5 4 3 2 1 1 0 7 5 4 3 2 0 1 0 6 B GS i t e B GS i t e L o c a t i o n 16 DRAFT 17