# EES Lehmann Alternatives-5-14-20

```DRAFT
Nonparametric Statistical Testing for Baseline Risk Assessment Screening
1.0 Introduction
The purpose of baseline risk assessment (BRA) screening at a possibly contaminated site is to
focus investigatory resources. An essential aspect of this is to determine which analytes merit
further study in the risk assessment process; that is, to identify the chemicals of potential concern (COPC) for the site. The COPC list begins as a very large list. It is whittled down based
on the results of BRA screening. Analytes that are not detected in any samples are dropped
from further consideration. Analytes that are known to be common laboratory contaminants
and for which there is evidence that all apparent detections may be due to laboratory contamination are considered for removal from the list of COPCs. Analytes that are known to be naturally occurring and appear not to have elevated site concentrations relative to background
(BG) are removed from the COPC list. This is the main point of application of statistical testing to the problem of BRA screening, although one could certainly argue that statistical methodology should be applied to the problem of screening common laboratory contaminants.
Two important features of statistical testing are false positive error rate and false negative rate.
The false positive rate  is the probability of making a false positive (Type I) error. In the setting of BRA screening, the statistical distributions of site concentrations are assumed to come
from the same population as BG unless there is sufficient evidence to the contrary. A false
positive error is the false conclusion that site concentrations of the analyte tend to be higher
than BG concentrations. The false positive rate is the client’s risk of having to perform unnecessary laboratory testing (and perhaps cleanup). The false negative rate  is the probability of
making a false negative (Type II) error. In the setting of BRA screening, a false negative error
is the false conclusion that site concentrations of the analyte do not tend to be higher than BG
concentrations and that, consequently, the analyte does not merit further consideration in risk
assessment or remediation.
1.1
Some Notation
This section briefly discusses some concepts and notation needed.
The shift or difference model is:
X i ~ F( x), iid for i  1,
, m,
Y j ~ F( y  ), iid for j  1,
, n, and
X i is independent of Y j for all i and j.
[1.1]
DRAFT
F() is a probability distribution function1. The symbol “~” in the first line above denotes that
F() is the common probability distribution function of the Xi. Also, “iid” means “independent
and identically distributed”.
A null hypothesis is a statement regarded as true unless there is sufficient evidence to the contrary. It is typically denoted as H0. The alternative hypothesis, H1, is the statement accepted
when H0 is rejected. The size or false positive rate () of a test is the probability of rejecting
H0 when H0 is actually true. The false negative rate () of a test is the probability of failing to
reject H0 when H0 is actually false. The Mann-Whitney-Wilcoxon Test is often used under
the shift model to test
H 0 :   0 versus
[1.2]
H1 :   0.
The power of a test is the probability that H is rejected given that a specific simple alternative
H1* is true, which of course means that H is false. The power of a test is a function of the parameter value in a simple alternative hypothesis. The power of a test is 1-, where  is understood to depend on the alternative parameter value.
The alternative hypothesis here is a composite alternative; more than one value of  is consistent with H1. By contrast, a simple alternative hypothesis specifies a single value of the parameter. An example of a simple alternative hypothesis is H1*:  = 1, where the distribution
F() is known. If F() is unknown under the shift model, the alternative hypothesis H1* is no
longer simple; since the space of distribution functions is infinite dimensional, the parameter
space represented by  = 1 with F() unknown is actually infinite dimensional with F() as an
infinite dimensional nuisance parameter.
2.0 What Kinds of Tests Are Appropriate?
The observations of site or BG data are often positively spatially dependent. Results from
nearby sample locations (at the same depth) tend to be more similar than results from distant
sample locations. Near a “hot spot” observations of site data may tend to be negatively spatially dependent over a short distance. Results from nearby sample locations (at the same
depth) tend to vary widely. Generally, environmental data (site or BG data) are neither independent nor identically distributed. However, under certain sampling designs and for certain
purposes we may justify the use of the iid model.
Let F() represent the probability distribution of measurements obtained at random locations in
a designated area A. Then F() is known as the global distribution function on the area A
(Isaaks &amp; Srivastava, chapter 18) or the spatial distribution (Journel). It is the mixture of the
probability distribution functions at each point in area A. The results of sampling at random
locations in area A, when location is ignored, behave as iid observations from the global distribution F(). The global distribution is a mixture distribution. This is apparent if the local distributions at each point in area A are not identical. The distribution of measurements obtained
1
A probability distribution function is a nondecreasing, right-continuous function such that F(-) = 0, and F() = 1.
2
DRAFT
at random locations in a designated area A is also a mixture distribution. Therefore, for testing
global hypotheses concerning the area A or estimation of global parameters of the area A, the
ordinary statistical procedures used in the iid case can be used correctly, although they may
not be very efficient compared to other procedures.
If the quantity of interest has spatial dependence over area A, then

estimation based on sampling with a randomized regular grid is more efficient (gives estimates with smaller variance) than estimation based on simple random sampling selection
of sampling locations (Cochran, pp. 227-228) and

an improved estimate of the global distribution function may be obtained from data gotten
by sampling at random locations by using spatial declustering weights (Isaaks &amp; Srivastava, chapter 18).
For sampling on a randomized regular grid, the spatial declustering weights are all equal2 like
the weights in the ordinary empirical distribution function. This argues that the empirical distribution function based on observations from a randomized regular grid converges more
quickly to the global distribution function than does the empirical distribution function based
on of sampling at random locations. Here we are not considering averaging over an infinite
number of realizations of a spatial random process but are only considering the realization at
hand. The observations are indeed random variables in this case because the sampling locations are randomized and the observations are subject to random collection and measurement
errors. They are also spatially dependent because their means are a function (albeit an unknown function) of location.
It is well known that in the iid case, the empirical distribution function converges to the true
distribution with probability 1 (wp1) as the number of observations goes to infinity. It is also
the case that the empirical distribution function based on sampling at random locations in area
A converges to the global distribution function wp1 as the number of observations goes to infinity. Since the area A is fixed and bounded, this means under infill asymptotics. Because
sampling on a randomized regular grid is more efficient than sampling at random locations, it
must also be true that the empirical distribution function based on sampling from a randomized regular grid over area A converges to the global distribution function at an even faster
rate. This means it behaves like the empirical distribution function based on sampling at a
larger number of random locations in area A. This is certainly worth further investigation.
Therefore sampling on randomized regular grids is virtually always to be preferred over sampling at random locations. Note that this behavior (of approximating the global distribution)
may be quite different under increasing domain asymptotics or considering multiple realizations of the spatial random process, where ergodicity of the process comes into play. However, in these cases also, sampling on randomized regular grids is the preferred method of sampling.
2
This is true for global kriging weights and for polygon declustering weights, except perhaps for points on the convex hull of the sample grid. Cell declustering weights for a regular grid should be very close to equal.
3
DRAFT
Statistics based on sample moments or ranks are functions of the empirical distribution function of the sample. Hence, tests based on such statistics should behave as if they were computed from a larger number of observations. Therefore their actual significance level ()
should be slightly lower than nominal and their actual power should be slightly higher than
nominal. This by no means excludes the construction of more powerful tests and more efficient estimators by explicitly using the spatial nature of the data.
On the other hand, the entire population of possible samples on a fixed site is finite, although it
is quite large. Most common statistical procedures, classical, Bayesian or geostatistical, assume an infinite population model, which is a good approximation to the real situation. Proponents of the nonparametric design-based approach to sampling (Cochran; de Gruitjer &amp; ter
Braak) argue that the classical sampling theory for finite populations, as developed by Neyman, does not require independence. Viewing the realization of area A as fixed and finite,
tests and estimates appropriate under finite sampling theory should be serviceable for comparison to BG. The improvement due to sampling on a randomized regular grid is presented by
Cochran as equivalent to the improvement due to cluster sampling.
3.0 Power of Statistical Tests
3.1
Sample Size Planning
Before any sampling is conducted, sample size (that is, number of sample locations) determinations must be made, both for BG and for the sites being studied. Desired power against
specified alternatives is an essential element of sample size determination for comparison to
BG. We will use sample size determination for the two-sample t-test as an example.
Let
, X m ~ iid N   0 ,  2  and
X1,
Y1 ,
[3.1]
, Yn ~ iid N  1 ,  2  .
Then the sample means are distributed as:
Xm 
Yn 
1
n


1
m
m
i 1
X i ~ N  0 ,  2 m  and
[3.2]
Y ~ N  1 ,  2 n .
j 1 j
n
Also, X m is independent of Y n . Consequently,
Y n  X m ~ N  ,  m1  1n   2  , where  =1   0 .
[3.3]
Then the shift hypotheses are:
H 0 :   0  1   0  versus
H1 :   0  1   0  0  .
4
DRAFT


Define T  Yn  X m   m1  1n    mmn n Yn  X m   . Then T ~ N mmnn  /  ,1 , and
under H0 T ~ N  0,1 . To test H0 versus H1, we reject H0 for large values of T.
The significance level (Type I error rate) of the test is determined by the equation
P T  z1 H 0    . For a specified * &gt; 0, let H1* denote the hypothesis that  = *. The
power of the test of H0 versus H1* is determined by the equation:

 
P T  z1 H1*  P Z 

1   z1 
mn
m n
mn
m n
 
   z1  P Z  z1 

mn
m n

  
  1  ,
where Z is standard normal random variable,  is its distribution function, and  = /.
Then

 z1 
mn
m n

    z1 
mn
m n
  z   z1  
mn
m n
mn
m n
  z1  z1    
  z1  z1    2 .
2
Actually, we do not know  and must estimate with it with S. An approximate but accurate
correction for this is given in EPA QA/G4, page 63. Using this correction, we get
mn
m n
  z1  z1    2  0.5z12 .
2
Note that        0   0   f CV, where f is the fraction increase over BG that
we wish to detect with probability 1-, and CV is the coefficient of variation (relative standard
deviation) for BG. Then we have
mn
m n
 CV 2  z1  z1  
2
f 2  0.5 z12 . For instance, we
may want to plan to be able to detect a 50% increase above the BG mean with a probability
(level of confidence) of at least 0.8. Then we have  = 0.2 and  = 0.5/CV.
Now we may wish use equal sample sizes for site and BG; that is, m = n. Then we have
mn
m n
 n2  CV2  z1  z1  
2
f 2  0.5 z12  m  n  2CV 2  z1  z1  
2
f 2  z12 .
On the other hand, if we fix the number of BG samples at m = M, we must also fix a lower
bound for n, say n0 = 7. Then we have
 ,   CV2  z1  z1  
2
Mn
M n
 CV 2  z1  z1  
f 2  0.5 z12 . Then we get
Mn
M n
2
f 2 . For convenience, set
M   .
 M     ,
  ,   n  M  , 

 ,
Finally, we estimate the number of site samples by n*  max n0 , M  , 
,

where  x  means the smallest integer that is greater than x. By the Central Limit Theorem,
5
DRAFT
the distribution of the mean of our data will be asymptotically normal. So, even if the data is
non-normal, these sample number calculations should be at least approximately correct.
For purposes of planning when the Mann-Whitney-Wilcoxon test is to be used for comparison
to BG, we make use of the asymptotic relative efficiency (ARE) of the Mann-WhitneyWilcoxon test relative to the two-sample t-test. When the underlying distribution is normal,
this ARE is about 0.955 (3/ exactly). So we need about 20 samples for every 19 that we calculate or nw  1.05n*  .
In summary, the power of a statistical test is something that absolutely must be planned up
front as part of the sampling design and incorporated into the sampling plan (Field Sampling
Plan, Data Acquisition Plan or QAPP). This requires decisions about

the level of confidence we wish to have against false positive errors (1-),

the specific alternatives which we wish to protect against (H1*) and

the power or level of confidence we wish to have against false negative errors (1-).
Finally, the decisions listed above must be balanced against the costs of sampling and analysis
and the costs of decision errors.
3.2
Retrospective
After the data have been collected, analyzed, statistical tests have been performed and decisions made, it is possible to go back to the data to estimate the power achieved by the statistical tests. Although this sort of retrospective look at power is controversial among statisticians,
it has been presented by USEPA as part of verifying the achievement of the Data Quality Objectives (DQO) goals for the project. This is demonstrated for the two-sample t-test in EPA
QA/G9, pages 3 through 6 of section 3.3. Verifying that the correct statistical tests have been
used is also part of this process.
The DQO process is a planning process developed by USEPA and mandated for use in all
projects in which environmental measurements are collected. For further information, refer to
the relevant USEPA documents (EPA Order 5360.1, EPA QA/G4, EPA QA/G9).
4.0 Statistical Procedures
4.1
Properties of the Mann-Whitney-Wilcoxon Test
The Mann-Whitney-Wilcoxon test has two equivalent forms of the test statistic. The Wilcoxon form is the W statistic that is widely used. W is the sum of the ranks of the Y observations
when the ranking is done over all the data.
The Mann-Whitney form is
6
DRAFT
1
U  mn
 I  X i  Yj   fraction of time that an X i is less than a Yj  , where
m
n
i 1 j 1
0, if the event A does not occur 
I A  
 is the indicator function.
1, if the event A is occurs

The expected value of U is E U   P  X  Y  , the probability that a randomly selected value
from the X population is less than a randomly selected value from the Y population. W and U
n n 1
are related by W  mnU   2  .
We can therefore use the Mann-Whitney-Wilcoxon statistic to test the stochastic ordering hypotheses
H 0so : P  X  Y   1 2 versus
H1so : P  X  Y   1 2 or versus
[4.1]
H1so* : P  X  Y   2 3 , for instance.
Stochastic ordering is defined by:
Let X be a random variable with a probability distribution function F() and Y a random
variable with a probability distribution function G(). The Y is stochastically larger than
X if F  t   G  t  for all t, with strict inequality for at least one t.3
So a graph of G(t) against F(t) would be on or below (no part above and at least some part
strictly below) the diagonal. H0so says that Y observations (the site measurements) do not tend
to be larger than X observations (the BG measurements). H1so says that Y observations (the
site measurements) do tend to be larger than X observations (the BG measurements). H1so*
says that 2 out of 3 times (on average) a randomly selected site measurement will be larger
than a randomly selected BG measurement. These hypotheses are more general than the shift
(difference) hypotheses and make more sense for comparison to BG.
One way to calculate the power of the Mann-Whitney-Wilcoxon test for testing H0so against
the alternative H1so* is to use the Two Sample U-Statistic Theorem (Randles &amp; Wolfe, p. 92)
to find a normal approximation for the standardized distribution of W under H1so*. Under this
n n 1
theorem, mmnn W  mn  2  is approximately normally distributed with mean 0 and var

iance
 m2 ,n 
 1,0
 0,1

, where
m,n 1  m,n
m
m ,n = m+n
,
3
Randles &amp; Wolfe, pp. 130-131.
7
DRAFT
 1,0  P  X 1  min Y1 , Y2     2 ,
 0,1  P  max  X 1 , X 2   Y1    2 , and
[4.2]
 = P  X 1  Y1 .
In order to get a reasonable estimate of  m2 ,n for a practical problem, one must specify distributions of X and Y that are reasonable for the problem. In nonparametric procedures, the null
distributions of the test statistics do not depend on the underlying distribution of the data, but
the alternative distributions typically do. Unfortunately, even for the simplest choices of distributions for X and Y under the alternative H1so*, the calculation of  m2 ,n is involved. For example, choosing the X distribution to be the continuous uniform distribution on the interval
[0,1] (written as U[0,1] ) and choosing the Y distribution to be U[1/3, 4/3], so that P(X&lt;Y) =
2/3, it takes about almost a page of theoretical calculations to determine that  1,0   0,1  812 .
By contrast, under H0so,  1,0   0,1  121 . For nonsymmetric distributions (which are typical of
environmental data)  1,0 and  0,1 are not equal. Often,  m2 ,n is estimated by Monte-Carlo
simulation. Fortunately, there is another way to calculate the power of the Wilcoxon and other
rank tests exactly, under a restricted family of alternative hypotheses known as the Lehmann
alternatives. The Lehmann alternatives are discussed in section 4.2.
Despite the difficulties of calculating its power, the one-sided Mann-Whitney-Wilcoxon test is
a very good test and in most cases a superior alternative to the two-sample t-test. It is unbiased and consistent against shift alternatives and stochastic ordering alternatives. Since it is
nonparametric4 distribution-free5, the actual Type I error level is always equal to the nominal
level. When the data is from normal populations, the power of the Mann-Whitney-Wilcoxon
test is almost as great as that of the two-sample t-test. For continuous distributions, the power
of the Mann-Whitney-Wilcoxon test may be much better than but is never much less than the
power of the two-sample t-test (Randles &amp; Wolfe, pp. 117-119, 163-171).
Both the shift and stochastic ordering alternatives are important in BRA screening. Detection
of site “hot spots” or high concentrations (relative to BG) is also important in BRA screening.
The right-shift alternative corresponds to widespread (that is, high probability) anthropogenic6
impact at a relatively constant level. The “hot spot” alternative corresponds to a high concentration impact over a small area (low probability). The stochastic ordering alternative encompasses both of these alternatives and intermediate cases as well. Testing for a “hot spot” alternative is essentially equivalent to testing for stochastic ordering in the right tails of the distributions.
Although the Mann-Whitney-Wilcoxon test is consistent against all stochastic ordering alternatives, it was not designed to detect stochastic ordering in the right tails and has relatively
4
Nonparametric indicates testing hypotheses that are not a function of a parameter specific to a particular distributional form.
5
Distribution-free indicates that under the null hypothesis, the distribution of the test statistic does not depend on
the underlying distribution of the data.
6
Anthropogenic means caused or created by man.
8
DRAFT
low power against this sort of alternative. Consequently, another test is needed to augment the
Mann-Whitney-Wilcoxon test in order to detect “hot spots”.
Fortunately, both the shift and stochastic ordering in the right tail alternatives can be formulated as Lehmann alternatives. Lehmann alternatives and these formulations are discussed in
the next section.
4.2
Lehmann Alternatives
Consider the following model and compare it to the shift (difference) model in section 1.1:
X i ~ F( x), independent for i  1,
, m,
Y j ~ g  F( y)  , independent for j  1,
[4.3]
, n,
where g() is a nondecreasing function on [0,1] such that g(0) = 0 and g(1) = 1. The hypotheses of interest for comparison to BG may be posed as
H 0 : g  u   u versus
[4.4]
H1 : g  u   g*  u  ,
where g*() is constructed to satisfy the alternative of interest. Any g*() such that g*  u   u ,
with strict inequality for some u, specifies some version of stochastic ordering.
The nice thing about the Lehmann alternatives is that, as long as F() is continuous, the distribution of any rank statistic7 under the alternative hypothesis does not depend on F() but only
on g*(). Thus, the distribution of any rank statistic under a Lehmann alternative is nonparametric distribution-free. Consider two cases of Lehmann alternatives:
(Case A): For the shift (difference) alternative, g*  u;    u1 , 0    u  1 , is a reasonable
construction for non-negative data. The hypotheses for the shift model become
H 0 :   0 versus
H1 :   0.
For  = 0.1, for instance, g*(u; 0.1), which represents the
specific alternative hypothesis H1*:  = 0.1, has the graph
shown at right. The dashed line is the graph of g(u) = u,
which represents H0. Note that, in line with the definition of
stochastic ordering, the graph of g*(u; 0.1) lies below the
diagonal.
 Figure 1: Example Case A Lehman
Alternative
1
0.8
0.6
0.4
0.2
If one has in mind a specified shift, say  , that one wishes
to detect with specified probability, and a specific distribu*
7
A rank statistic is a statistic that is a function only of the ranks of the data.
9
0
0
0.2
0.4
0.6
0.8
1
DRAFT
tion F() for BG (modeled from the data perhaps), then * = F(*) gives what is in essence a
simple alternative hypothesis against which to evaluate power. In the case of the exponential
distribution with mean ,  produces a shift of  = -ln(1-), which is a convenient multiple
of the mean.
Lehmann (Lehmann 1953) used Hoeffding’s Lemma to calculate the distribution of rankbased tests under Lehman alternatives. The power of the Mann-Whitney-Wilcoxon test under
the Case A Lehmann alternative is most easily derived by computing the mean and asymptotic
variance of W under the alternative and using the normal approximation given by the Two
Sample U-Statistic Theorem. We have
E1 U     1    2 ,
E1 W   mn 
n n 1
2

[4.5]
n m  n 1  mn
2
,
[4.6]
 1,0    13   2  121 1  6  3 2  , and
[4.7]
 0,1   2  13 1     2    2  121 1  2  13 2  .
[4.8]
W  n mn21mn  is approximately normally distributed with mean 0 and variance


Then
m n
mn
 m2 ,n 
 1,0

1  6  3 2 1  2  13 2 
 0,1   m  n  

.
m,n 1  m,n
12n
 12m

[4.9]
So, for instance, if we take m = 15, n = 10, and  = 0.5, we have
E1 W  
n m  n 1  mn
2

101510 1 15100.5
2
 167.5 , and
1  6  0.5  3  0.25 1  2  0.5  13  0.25 

  0.8285 .
12 15
12 10


 m2 ,n  25 
The 0.05 critical value of W, with m = 15 and n = 10, is 160, so we estimate the power to be
P1 W  160   1  

1510 160 167.5
1510
0.8285
  1    0.275  0.608.
(Case B): For the stochastic ordering in the right tails alternative one very reasonable construction has g*  u;    1    u   u h for 0    1, and h &gt; 1. This is a mixture distribution representing BG interspersed with areas of elevated contamination; that is, moderate “hot
spot” contamination. This interpretation is particularly appropriate if random or regular grid
sampling has been used. For instance,  = 0.5 represents the hypothesis that half of the area is
impacted by moderately higher concentrations.
10
DRAFT
The graph of g*(u;) for the case with  = 0.2 and h = 4 is
shown at right. The dashed line is the graph of g(u) = u,
which represents H0. Fixing , the hypotheses for the “hot
spot” contamination model become
1
0.8
0.6
0.4
H 0 : h  0 versus
0.2
H1 : h  0.
0
0
which is very similar to the shift hypothesis case.
4.3
0.2
0.4
0.6
0.8
1
 Figure 2: Example Case B Lehman
Alternative
Rosenbaum’s Two-Sample Location Test
A version of Rosenbaum’s two-sample location test statistic is T, the number of Y values larger than the largest X value. This statistic is very useful for testing for stochastic ordering in the
right tails. The null distribution of T is
P0  T  t    mnntt 1
mnn ,
t  0,1,
,n,
[4.10]
where m is the number of BG samples, n is the number of site samples, and
 m
t 
 
denotes the
number of ways that m objects can be selected t at a time. The critical value for the level-
test is the smallest integer C such that P0  T  C   t C P0  T  t   t C  mnntt 1
n
n
mnn   .
Sukhatme (Sukhatme 1992) has demonstrated a simpler method of computing the power of
rank-based tests under Lehman alternatives, using distribution-free exceedance statistics. Sukhatme’s methodology and results were used to derive the results on power presented below.
A copy of a preprint of Sukhatme’s paper is attached to this report.
The power of Rosenbaum’s two-sample location test under the Case B Lehmann alternative is
n
n
t C
t C
given by 1    P1  T  C    P1  T  t   1       hm  nt  B  hm  t , n  t  1 , where
 
B(a,b) is the Beta function, defined by B  a,b     a    b    a  b  , and () is the gamma

function, defined by   a    u a 1eu d u .
0
Under the “hot spot” model, high power can only be achieved if a substantial fraction of the
area is contaminated. This is true for any rank based test and arises from the term 1   
which will appear in the formula for the power against the BG with “hot spot” contamination
model. Nevertheless, Rosenbaum’s test is certainly more sensitive to high values than the
Mann-Whitney-Wilcoxon test is. However, it does not provide protection against the situation
in which a relatively small fraction of the area is contaminated but at high concentrations.
This alternative can be protected against in a nonparametric and asymptotically distributionfree manner by using a probability inequality that holds for all distributions. This is discussed
in the next section.
11
DRAFT
4.4
Chebyshev’s Inequality
Chebyshev’s inequality is probably the most familiar probability inequality. It states that, for
any random variable X with mean  and variance 2, P  X       1  2 for   0. Of
course this is only useful for  &gt; 1. Given a sample of m BG observations X1 , , X m iid from
an unknown distribution F() with finite variance 2 and given Y, a new independent observation from F(), we have E Y  X  0 and Var Y  X  mm1  2 . From this, we obtain

P Y  X  10
P Y  X 



   1  for  &gt; 0. Setting  = 10 gives
 
   0.01 which holds for all probability distributions with finite variance.
 P Y  X 
m1
m

m1
m
2
m1
m
In this class of distributions, X is a consistent estimator of , and S 
a consistent estimator of . Then Cn  X  10

P X  X  10
m1
m

m 1
n
1
m 1
 X
m
i 1
i
X

2
is
S is a consistent estimator of  + 10, and
S converges in probability to P  X    10  , by Slutsky’s Theorem.
Therefore, Cn may be taken as an estimated upper bound to a 99% confidence Upper Prediction Limit (UPL) for BG observations. It provides a nonparametric and asymptotically distribution-free upper limit with protection against extreme “hot spots” that rank-based procedures
cannot provide. This works because the limiting inequality,

P Y  X  10
m1
m

P
S 
P  X    10    F  0.01 (where F depends on the BG dism
tribution F) holds for all BG distributions F. To set  to achieve a specified (conservative)
level , set   1  . For instance, 10  1 0.01 , 6.32  1 0.025 , and 4.47  1 0.05 .
Applying this limit to each of n independent site observations, the overall false positive rate is
n
n
 o  1  1   F    n  1  1    . For n = 10 and  = 0.01,  o  0.096  0.1 . Therefore,
1   n  1    is a lower bound for the confidence coefficient of Cn as a UPL for n addin
tional BG observations.
Other conservative 95% Upper Prediction Limits for BG observations can be derived from
other probability inequalities. In particular, one based on Markov’s inequality, using the third
absolute central moment of the BG distribution, gives limits that are a little tighter than Cn.
However, this requires estimation of the third moment of the BG population, which cannot be
estimated from small datasets nearly as accurately as the standard deviation.
5.0 Recommendations
As with all statistical analyses, the first step should be to plot the data in various ways in order
to become acquainted with its large-scale features. This is also good insurance against data
entry errors and against errors in applying or interpreting statistical tests. A couple of example
12
DRAFT
data plots appear on the pages following the text of this paper. They were created using
SPLUS 4.5. From the first plot, it is evident that chromium (Cr) and manganese (Mn) concentrations are more closely related in the site dataset than they are in the BG dataset. This could
be explored using various multivariate statistical techniques, cross-variogram analysis (a spatial statistics technique), geochemistry (if a mineralogical analysis of the soils has been done),
and comparison to chemical analysis of the wastes that may have impacted the site.
Another noteworthy feature of the data plots is that, except for the two site locations which are
clearly higher than BG, the site data shows substantially less spread for Cr and Mn and a tighter association between Cr and Mn than does the BG data. It is often the case that BG sample
locations tend to be more widely spaced than site sample locations. Spatially correlated regional variation combined with larger average sample location spacing for BG can easily explain this phenomenon. This phenomenon may invalidate the shift model but not the stochastic ordering model.
The second step is to apply statistical screening. We recommend a three pronged approach to
screening, simultaneous testing using:

the Mann-Whitney-Wilcoxon test to detect widespread low-level anthropogenic contamination (the shift alternative or, more generally, stochastic ordering),

Rosenbaum’s two-sample location test to detect localized higher concentration anthropogenic contamination (stochastic ordering in the right tails), and

a UPL for n additional observations based on Chebyshev’s inequality to protect against
extreme “hot spots”.
We recommend using the Bonferroni inequality to set the level of the combined procedure
based on using the rank tests. For an overall level of comb = 0.05, we recommend using  =
0.02 for the Mann-Whitney-Wilcoxon test and for Rosenbaum’s two sample location test and
using  = 0.01 for a (1-)100% = 99% UPL based on Chebyshev’s inequality. The levels of
the individual tests then sum to 0.05 to give an approximate overall level of 0.05.
6.0 References
Cochran, W.G. (1977). Sampling Techniques: Third Edition. New York, John Wiley &amp; Sons,
428 pp.
Cressie, N. A. C. (1991). Statistics for Spatial Data. New York, John Wiley &amp; Sons, 900 pp.
De Gruitjer, J.J. and ter Braak, C.J.F. (1990). Model-Free Estimation from Spatial Samples: A
Reappraisal of Classical Sampling Theory. Mathematical Geology. Vol. 22. pp. 407-415.
Isaaks, E. and Srivastava R.M. (1989). An Introduction to Applied Geostatistics. Oxford
University Press, New York. 561 pp.
13
DRAFT
Journel, Andre (1983). Nonparametric Estimation of Spatial Distributions. Mathematical Geology. Vol. 15. pp. 445-467.
Lehmann, Erich (1953). The Power of Rank Tests. Annals of Mathematical Statistics. Vol. 24.
pp. 23-43.
Randles, R.H. and D.A. Wolfe (1991). Introduction to the Theory of Nonparametric Statistics.
Krieger Publishing Co., Malabar, Florida. 450 pp.
Rosenbaum, S. (1954). Tables for a Nonparametric Test of Location. Annals of Mathematical
Statistics. Vol. 25. pp. 146-150.
Sukhatme, Shashikala (1992). Powers of Two-Sample Rank Tests Under the Lehmann Alternatives. The American Statistician. Vol. 46. pp. 212-214.
14
DRAFT
 Figure 3: Example Site/BG Scatterplot
v s .
Ba
C5.0r(mg/k) 10. 50.
Site
0.5 1.0
B G
S i te
5
1 05
1
0
05
0
1
0
0
0
5
00
00 0
M
n
15
(m g /kg )
DRAFT
 Figure 4: Example Site/BG Boxplots
9
7
5
4
3
2
M
n
b
y
L
o
c
a
t
i
o
n
C
r
b
y
L
o
c
a
t
i
o
n
3
1
0
7
5
4
3
2
mg/k
2
1
0
7
5
4
3
2
1
1
0
7
5
4
3
2
0
1
0
6
B
GS
i
t
e
B
GS
i
t
e
L
o
c
a
t
i
o
n
16
DRAFT
17
```