When is the Wilcoxon-Mann-Whitney Procedure a Test of Location?

advertisement
When is the Wilcoxon-Mann-Whitney Procedure a Test of
Location?
Scott Parker1 , Robert W. Jernigan2 , and Joshua M. Lansky2
1
2
Department of Psychology, American University
Department of Mathematics and Statistics, American University
May 16, 2013
Abstract
We detail difficulties with the usual discussion of the Wilcoxon-Mann-Whitney nonparametric procedure as a test of location. Although the procedure is invariant under
monotone transformations of the variables tested, its use as a test of location or shift
is said not to be so. The thinking is that the shift model requires the assumption of
parallel cdfs. A result presented here shows that this condition is abundantly fulfilled,
for there are an infinite number of monotone transformations of the variable under
study that produce parallel cdfs, so long as the original cdfs intersect nowhere or
everywhere. This gives rise to an infinite number of shift parameters, invalidating the
notions that there is one true shift parameter and that we can estimate it. Comparing
the cdfs using the Probability of Superiority is not prone to this difficulty.
Imagine that two experimenters, Bert and Ernie, want to investigate whether listening
to music affects the visual perception of size. Suppose that they get two groups of people
and ask each person to draw a circle the size of a quarter, some people doing this while
1
listening to music and others doing it in silence. The null hypothesis is, of course, that there
is no difference between the sizes of circles drawn in silence or with music. In particular,
Bert and Ernie want to test for a difference in location between the two sorts of circles.
The statistical analysis compares the two groups of circles’ sizes.
To guard against problems engendered by non-normality of population distributions
Bert and Ernie decide to use a Mann-Whitney (MW) test rather than a t-test. (Our
argument will apply equally well to Wilcoxon’s equivalent test based on rank-sums.) Each
of them takes the data home and analyzes it overnight. Bert uses the circles’ diameters as
the measure of circle size and calculates the value of the test statistic, UX . Let the two
sets of sample diameters be denoted xi and yj , as realizations of random variables X and
Y . Then UX is the count of all occasions when xi > yj plus
1
2
the occasions when xi = yj .
(UY is defined analogously, and clearly UX + UY = mn where i = 1, ..., m and j = 1, ..., n).
Ernie, however, measures the circles’ areas rather than their diameters and gets the value of
UX based on those numerically different sets of xi and yj . In the morning, Bert and Ernie
compare notes and find (as expected) that their U -values are equal, the MW procedure
depending only on ordinal relations among the two samples’ data sets. But can they come
to altogether the same conclusions concerning their data analyses? Distressingly, the usual
approach to the MW procedure suggests that they cannot.
The null distribution of the MW U -statistic arises from the hypothesis that the distributions of circle sizes (those drawn with music and those drawn in silence) are identical.
Usually this is stated in terms of the cdf’s, F and G, for the two populations from which
the samples come. Denoting by X the random variable we are using (diameter or area),
the null hypothesis is H0 : F (x) = G(x) (this is the statement of H0 from Conover (1999),
p.273; Gibbons & Chakraborti (2011), p.262; Higgins (2004), p.28; Hollander & Wolfe
(1999), p.106; Lehmann (1986), p.314; Neuhuser (2012), p.5; an equivalent but not formal
statement is in Sprent & Smeeton (2007), p.152).
2
Our experimenters, Bert and Ernie, are interested in determining whether or not circles
drawn in one circumstance tend to be bigger than circles drawn in the other, and if so by
how much. The MW is not fully distribution-free; it requires some presumed distributional
restriction in order to be a “pure” test of location (a point not made by all textbooks). The
usual restriction, which is indeed sufficient, is to require that the cdf’s, F and G, be parallel
with a constant horizontal separation, that is, that F (x) = G(x + ∆). Hollander & Wolfe,
(1999) and Neuhuser, (2012) refer to this as the location-shift model or the translation
model where ∆ is the location shift; Lehmann (1998) and Gibbons & Chakraborti (2011)
call it the shift model. In this setting, the null hypothesis is H0 : ∆ = 0. It appears as
Assumption 4 in Conover (p.276). It permits the Mann-Whitney procedure to compare
population medians (as in e.g., Hettmansperger & McKean, (2012), p.85) or means (as in
e.g., Hollander & Wolfe, (1999), p.107), or both (as in, e.g., Higgins, (2004), p.29). We note
that although the original Mann and Whitney (1947) alternative H1 : F > G entails the
inference that both the means and medians differ, that is not the usual textbook approach
to the issue, nor is it what most users of the MW test want to know.
But note that in the experiment on circle sizes, Bert and Ernie have a problem. The
location-shift model cannot apply to both of the two variables: diameter and area. Representing diameter by d and area by a, unless ∆ = 0, we cannot have both F (d) = G(d + ∆d )
and F (a) = G(a + ∆a ). The nonlinear relation of d to a ensures that, for example, unless
∆d = ∆a = 0, the variances of F and G cannot be equal for both variables, d and a. Thus,
using the location-shift model as H1 , the MW test is a pure and exact test of location for
at most one of those cases. That is, Bert can validly compare the means or medians of
the two groups’ circles only if Ernie cannot, and vice versa. This seems problematic and
unsatisfying for the following reason.
The MW statistic counts the results of exhaustive ordinal comparisons between the
sample from F and the sample from G. The value of the MW statistic is altogether
3
insensitive to monotone transformations of the variable under study. The values of the
MW statistic computed for diameter, area, and for even a newly invented third variable
such as (ad)1/2 must all be equal. In addition, the null distributions of the MW statistic
for the two samples must be equivalent to each other, irrespective of which of the three
variables we use to measure circle size. So if a test for any one of them is a test of location,
the tests on all three of them should likewise be tests of location.
In order for that to be the case, it suffices that there exists a monotone transformation,
ϕ, such that the location-shift model holds for ϕ(x), i.e., F (ϕ(x)) = G(ϕ(x) + ∆). This
would relax the requirements for the MW test’s being a test of location, generalizing the
valid applicability of the test to cases where a) F (x) > G(x), b) F (x) = G(x), or c)
F (x) < G(x). The null hypothesis remains H0 : ∆ = 0. Obviously, this would make the
MW a test of location more generally than in cases where the location shift model on
X itself holds. It would extend the MW test’s validity throughout the obvious family of
variables for which the test statistic is invariant.
And, fortunately, if F and G are strictly increasing and intersect everywhere or nowhere,
i.e., if a) or b) or c) above holds, then there is indeed such a monotone transform ϕ. A proof
(due to JML) is in the Appendix. In its constructive development, it bears resemblance to
a similar approach taken in Levine (1970).
Thus the Wilcoxon-Mann-Whitney procedure is more nearly distribution-free than it is
usually said to be. The only requirement on F and G is that they be either a) everywhere
separate or b) identical. We no longer need require that they be parallel over the untransformed scale in which the data are reported. The null hypothesis of the test remains H0 :
∆ = 0 in a transformed (and unknown) variable that renders F and G parallel, i.e., in
any variable for which the location-shift model holds. And the original Mann-Whitney
alternative H1 : F > G entails that the shift model holds.
But although this reveals the generality of the circumstances in which the MW is a valid
4
test of location, it also leads to a problem in estimating the magnitude of the difference
between F and G. One natural choice has been to estimate the shift parameter, ∆, and
a popular estimate is that due to Hodges and Lehmann (1963). But the result in the
Appendix shows it to have a severe shortcoming.
Let us denote by xi and yj the elements of the samples from F and G. The HodgesLehmann estimate of ∆ is the median of all the differences (xi − yj ). The problem is that
those differences are calculated in the untransformed scale of the original data, a scale in
which the shift model may well not hold. In addition, note that for nonparallel F and G
there are infinitely many monotone transformations that render F and G parallel because
there are infinitely many choices for the value of a0 in the proof in the Appendix. Therefore
there are infinitely many shift parameters that could imaginably be estimated, even though
none of the appropriate rescalings are known. This suggests that estimating “the” shift
parameter is not a useful activity.
There is an approach to estimating the difference between F and G based on what
Grissom and Kim (2012) call the Probability of Superiority, P (Xi > Yj ). Cliff (1996) used
this approach defining δ = P (Xi > Yj ) − P (Xi < Yj ), a quantity he calls dominance (see
also Agresti (2010)). Clearly this quantity ranges from -1 to +1 and is invariant over all
monotone transformations of the variable in which X and Y are measured. Furthermore
it is readily estimated by δ̂ = (UX − UY )/mn. Confidence intervals for it are discussed in
del Rosal, San Luis and Sanchez-Bruno (2003). Confidence intervals for the Probability of
Superiority are discussed in Newcombe (2006a and 2006b). The virtue of adopting one of
these as the effect size is that they are immune to any single monotone transformation of
both X and Y .
Discussion of the power of the MW test should refer to power against non-null values of
one of these quantities but not to differences between means or medians. Different choices
of the monotonic transformation that renders F and G parallel can change the latter two
5
effect size measures. The Probability of Superiority and the dominance, however, remain
stable under all such transformations.
We note also that the sort of logic we apply here to generalizing the conditions under
which the Wilcoxon-Mann-Whitney procedure is valid can be extended to the KruskalWallis test for three or more groups. However the specification of the sufficient conditions
on the cdf’s that permits them to be rendered parallel becomes quite complex when the
number of cdf’s grows beyond two, (Levine, 1970).
Appendix: Transformations Rendering Two Curves Parallel
Let F and G be increasing continuous functions on R with range (0, 1), and suppose G < F .
We want to show there is a homeomorphism ϕ : R → R such that the graphs of F ◦ ϕ
and G ◦ ϕ are horizontal translates of each other, i.e., such that for all x, (F ◦ ϕ)(x) =
(G ◦ ϕ)(x + ∆) for some constant ∆. We will define ϕ as the limit of an inductively defined
sequence of homeomorphisms.
Define a sequence of nonnegative real numbers as follows. Fix a real number a0 . Assuming ai has been defined, let ai+1 = G−1 (F (ai )). Then {ai } is increasing since G < F .
Moreover, we claim limi→∞ ai = ∞. For if not, {ai } is bounded and hence has a limit a
by the completeness axiom. Then, since F and G are continuous,
F (a) = lim F (ai ) = lim G(ai+1 ) = G(a),
i→∞
i→∞
contradicting F > G.
Let ϕ0 : R → R be the identity transformation.



x,



ϕ1 (x) =
G−1 (F (x − a1 )),




 x + a2 − 2a1 ,
6
Define a map ϕ1 : R → R by
if x < a1 ,
if a1 ≤ x ≤ 2a1 ,
if x > 2a1 .
Note that on each of the individual intervals above, ϕ1 is continuous and increasing. See
Figures 1, 2, and 3. Also,
G−1 (F (a1 − a1 )) = G−1 (F (0)) = a1
G−1 (F (2a1 − a1 )) = G−1 (F (a1 )) = a2 = 2a1 + (a2 − 2a1 ).
Thus ϕ1 is continuous at a1 and a2 . Hence ϕ1 is an increasing homeomorphism R → R.
Finally, if x ∈ [0, a1 ], then
F (ϕ1 (x)) = F (x) = G(G−1 (F (x))) = G(ϕ1 (x + a1 )).
Thus the graph of G ◦ ϕ1 on [a1 , 2a1 ] is a translate (by a1 ) of the graph of F ◦ ϕ1 on [0, a1 ].
Now suppose we’ve defined an increasing homeomorphism ϕn : R → R with the properties that
1. ϕn (ka1 ) = ak for k = 1, 2, . . . , n + 1.
2. F (ϕn (x)) = G(ϕn (x + a1 )) for x ∈ [0, na1 ].
Observe that ϕ1 satisfies these properties with n = 1.
Now define ϕn+1 : R → R by



ϕn (x),
if x < (n + 1)a1 ,



ϕn+1 (x) =
G−1 (F (ϕn (x − a1 ))), if (n + 1)a1 ≤ x ≤ (n + 2)a1 ,




 x + an+2 − (n + 2)a1 , if x > (n + 2)a1 .
Note that on each of the individual intervals above, ϕn+1 is continuous and increasing.
Also,
ϕn+1 ((n + 1)a1 ) = G−1 (F (ϕn ((n + 1)a1 − a1 )))
= G−1 (F (ϕn (na1 )))
= G−1 (F (an ))
7
= an+1
= ϕn ((n + 1)a1 )
and
φn+1 ((n + 2)a1 ) = G−1 (F (ϕn ((n + 2)a1 − a1 )))
= G−1 (F (ϕn ((n + 1)a1 )))
= G−1 (F (an+1 ))
= an+2
= (n + 2)a1 + [an+2 − (n + 2)a1 ].
This shows shows that ϕn+1 is continuous at (n + 1)a1 and (n + 2)a1 , hence is an increasing
homeomorphism R → R. Since ϕn satisfies property 1 above and ϕn+1 agrees with ϕn on
(−∞, (n + 1)a1 ], the second calculation shows that ϕn+1 also satisfies property 1 with n + 1
replacing n.
Similarly, since ϕn+1 and ϕn agree on (−∞, (n + 1)a1 ], and since ϕn satisfies property
2, we have that
F (ϕn+1 (x)) = G(ϕn+1 (x + a1 ))
for x ∈ [0, na1 ].
Now suppose that x ∈ [na1 , (n + 1)a1 ]. Then
F (ϕn+1 (x)) = F (ϕn (x))
(by the definition of φn+1 )
= G(G−1 (F (ϕn (x))))
= G(φn+1 (x + a1 )) (by the definition of φn+1 ).
Thus φn+1 satisfies property 2 with n + 1 replacing n.
By induction, we have therefore constructed an infinite sequence {ϕn } of increasing
homeomorphisms with the above properties. Define ϕ+ : R → R by
ϕ+ (x) = lim φn (x).
n→∞
8
For any x, this limit exists since the terms of the sequence are actually equal for n sufficiently large. Moreover, ϕ+ is continuous since ϕ+ agrees with the continuous function φn
on a neighborhood of x for n sufficiently large. It follows that ϕ+ : R → R is an increasing
homeomorphism. Moreover, it follows from property 2, that F (ϕ+ (x)) = G(ϕ+ (x + a1 ))
for x ∈ [0, ∞).
A similar argument proceeding in the negative direction will produce a homeomorphism
ϕ : R → R such that F (ϕ(x)) = G(ϕ(x + a1 )) for x ∈ (−∞, ∞).
We note that it is easily seen that different choices of a0 will, in general, give rise to
different homeomorphisms ϕ.
References
Agresti, A. (2010), Analysis of Ordinal Categorical Data (2nd ed.) New York: Wiley.
Cliff, N. (1993), Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions,
Psychological Bulletin, 114, 494–509.
Conover, W. J. (1999), Practical Nonparametric Statistics (3rd ed.) New York: Wiley.
del Rosal, A. B., San Luis, C., & Sanchez-Bruno, A. (2003), Dominance Statistics: A
Simulation Study on the d Statistic, Quality & Quantity, 37, 303–316.
Gibbons, J. D., & Chakraborti, S. (2011), Nonparametric Statistical Inference (5th ed),
New York: Marcel Dekker.
Grissom, R. J., & Kim, J. J. (2005), Effect Sizes for Research. (2nd ed.), New York:
Routledge/Taylor & Francis.
Hettmansperger, T. P., & McKean, J. W. (2011). Robust Nonparametric Statistical Methods (2nd ed.). Boca Raton, FL: Chapman & Hall, CRC.
Higgins, J. J. (2004), Introduction to Modern Nonparametric Statistics, Belmont CA:
Duxbury.
9
Hodges, J. L., & Lehmann, E. L. (1963), Estimates of Location Based on Rank Tests,
Annals of Mathematical Statistics, 34, 598–611.
Hollander, M., & Wolfe, D. A. (1999), Nonparametric Statistical Methods (2nd ed.), New
York: Wiley.
Lehmann, E. J. (1986), Testing Statistical Hypotheses (2nd ed.), New York: Wiley.
Lehmann, E. J. (1998), Nonparametrics: Statistical Methods Based on Ranks, Upper
Saddle River NJ: Prentice-Hall.
Levine, M. V. (1970), Transformations That Render Curves Parallel, Journal of Mathematical Psychology, 7, 410–443.
Mann, H.B., & Whitney, D.R. (1947), On a Test of Whether One of Two Random Variables
is Stochastically Larger Than the Other, Annals of Mathematical Statistics, 18, 5060.
Neuhuser, M. (2012), Nonparametric Statistical Tests: A Computational Approach. Boca
Raton, FL: Chapman & Hall, CRC.
Newcombe, R. G. (2006a), Confidence Intervals for an Effect Size Measure Based on the
Mann-Whitney statistic. Part I: General Issues and Tail-area-Based Methods, Statistics
in Medicine, 25, 543–557.
Newcombe, R. G. (2006b), Confidence Intervals for an Effect Size Measure Based on the
Mann-Whitney statistic. Part 2: Asymptotic Methods and Evaluation, Statistics in
Medicine, 25, 559–573.
Sprent, P., & Smeeton, N. C. (2007), Applied Nonparametric Statistical Methods (4th ed.),
Boca Raton, FL: Chapman & Hall/CRC.
10
Figure Captions
Figure 1
Consider two continuous, monotonic curves 0 ≤ G(x) < F (x) ≤ 1 on the real line. Define
a1 = G−1 (F (0)) and a2 = G−1 (F (a1 )).
Figure 2
The interval [a1 , a2 ] is first transformed by ϕ1 so that G (in yellow) on the interval [a1 , 2a1 ]
is a translation of F (in red) on the interval [0, a1 ]. Note F (in green) on this interval,
above F (a1 ), is also affected.
Figure 3
The transformed functions are now continuous only up to 2a1 . To restore full continuity,
F and G are redefined on the interval [2a1 , ∞) as F (x + (a2 − 2a1 )) and G(x + (a2 − 2a1 )),
producing F ◦ ϕ1 and G ◦ ϕ1 . See text for more details.
11
Download