When is the Wilcoxon-Mann-Whitney Procedure a Test of Location? Scott Parker1 , Robert W. Jernigan2 , and Joshua M. Lansky2 1 2 Department of Psychology, American University Department of Mathematics and Statistics, American University May 16, 2013 Abstract We detail difficulties with the usual discussion of the Wilcoxon-Mann-Whitney nonparametric procedure as a test of location. Although the procedure is invariant under monotone transformations of the variables tested, its use as a test of location or shift is said not to be so. The thinking is that the shift model requires the assumption of parallel cdfs. A result presented here shows that this condition is abundantly fulfilled, for there are an infinite number of monotone transformations of the variable under study that produce parallel cdfs, so long as the original cdfs intersect nowhere or everywhere. This gives rise to an infinite number of shift parameters, invalidating the notions that there is one true shift parameter and that we can estimate it. Comparing the cdfs using the Probability of Superiority is not prone to this difficulty. Imagine that two experimenters, Bert and Ernie, want to investigate whether listening to music affects the visual perception of size. Suppose that they get two groups of people and ask each person to draw a circle the size of a quarter, some people doing this while 1 listening to music and others doing it in silence. The null hypothesis is, of course, that there is no difference between the sizes of circles drawn in silence or with music. In particular, Bert and Ernie want to test for a difference in location between the two sorts of circles. The statistical analysis compares the two groups of circles’ sizes. To guard against problems engendered by non-normality of population distributions Bert and Ernie decide to use a Mann-Whitney (MW) test rather than a t-test. (Our argument will apply equally well to Wilcoxon’s equivalent test based on rank-sums.) Each of them takes the data home and analyzes it overnight. Bert uses the circles’ diameters as the measure of circle size and calculates the value of the test statistic, UX . Let the two sets of sample diameters be denoted xi and yj , as realizations of random variables X and Y . Then UX is the count of all occasions when xi > yj plus 1 2 the occasions when xi = yj . (UY is defined analogously, and clearly UX + UY = mn where i = 1, ..., m and j = 1, ..., n). Ernie, however, measures the circles’ areas rather than their diameters and gets the value of UX based on those numerically different sets of xi and yj . In the morning, Bert and Ernie compare notes and find (as expected) that their U -values are equal, the MW procedure depending only on ordinal relations among the two samples’ data sets. But can they come to altogether the same conclusions concerning their data analyses? Distressingly, the usual approach to the MW procedure suggests that they cannot. The null distribution of the MW U -statistic arises from the hypothesis that the distributions of circle sizes (those drawn with music and those drawn in silence) are identical. Usually this is stated in terms of the cdf’s, F and G, for the two populations from which the samples come. Denoting by X the random variable we are using (diameter or area), the null hypothesis is H0 : F (x) = G(x) (this is the statement of H0 from Conover (1999), p.273; Gibbons & Chakraborti (2011), p.262; Higgins (2004), p.28; Hollander & Wolfe (1999), p.106; Lehmann (1986), p.314; Neuhuser (2012), p.5; an equivalent but not formal statement is in Sprent & Smeeton (2007), p.152). 2 Our experimenters, Bert and Ernie, are interested in determining whether or not circles drawn in one circumstance tend to be bigger than circles drawn in the other, and if so by how much. The MW is not fully distribution-free; it requires some presumed distributional restriction in order to be a “pure” test of location (a point not made by all textbooks). The usual restriction, which is indeed sufficient, is to require that the cdf’s, F and G, be parallel with a constant horizontal separation, that is, that F (x) = G(x + ∆). Hollander & Wolfe, (1999) and Neuhuser, (2012) refer to this as the location-shift model or the translation model where ∆ is the location shift; Lehmann (1998) and Gibbons & Chakraborti (2011) call it the shift model. In this setting, the null hypothesis is H0 : ∆ = 0. It appears as Assumption 4 in Conover (p.276). It permits the Mann-Whitney procedure to compare population medians (as in e.g., Hettmansperger & McKean, (2012), p.85) or means (as in e.g., Hollander & Wolfe, (1999), p.107), or both (as in, e.g., Higgins, (2004), p.29). We note that although the original Mann and Whitney (1947) alternative H1 : F > G entails the inference that both the means and medians differ, that is not the usual textbook approach to the issue, nor is it what most users of the MW test want to know. But note that in the experiment on circle sizes, Bert and Ernie have a problem. The location-shift model cannot apply to both of the two variables: diameter and area. Representing diameter by d and area by a, unless ∆ = 0, we cannot have both F (d) = G(d + ∆d ) and F (a) = G(a + ∆a ). The nonlinear relation of d to a ensures that, for example, unless ∆d = ∆a = 0, the variances of F and G cannot be equal for both variables, d and a. Thus, using the location-shift model as H1 , the MW test is a pure and exact test of location for at most one of those cases. That is, Bert can validly compare the means or medians of the two groups’ circles only if Ernie cannot, and vice versa. This seems problematic and unsatisfying for the following reason. The MW statistic counts the results of exhaustive ordinal comparisons between the sample from F and the sample from G. The value of the MW statistic is altogether 3 insensitive to monotone transformations of the variable under study. The values of the MW statistic computed for diameter, area, and for even a newly invented third variable such as (ad)1/2 must all be equal. In addition, the null distributions of the MW statistic for the two samples must be equivalent to each other, irrespective of which of the three variables we use to measure circle size. So if a test for any one of them is a test of location, the tests on all three of them should likewise be tests of location. In order for that to be the case, it suffices that there exists a monotone transformation, ϕ, such that the location-shift model holds for ϕ(x), i.e., F (ϕ(x)) = G(ϕ(x) + ∆). This would relax the requirements for the MW test’s being a test of location, generalizing the valid applicability of the test to cases where a) F (x) > G(x), b) F (x) = G(x), or c) F (x) < G(x). The null hypothesis remains H0 : ∆ = 0. Obviously, this would make the MW a test of location more generally than in cases where the location shift model on X itself holds. It would extend the MW test’s validity throughout the obvious family of variables for which the test statistic is invariant. And, fortunately, if F and G are strictly increasing and intersect everywhere or nowhere, i.e., if a) or b) or c) above holds, then there is indeed such a monotone transform ϕ. A proof (due to JML) is in the Appendix. In its constructive development, it bears resemblance to a similar approach taken in Levine (1970). Thus the Wilcoxon-Mann-Whitney procedure is more nearly distribution-free than it is usually said to be. The only requirement on F and G is that they be either a) everywhere separate or b) identical. We no longer need require that they be parallel over the untransformed scale in which the data are reported. The null hypothesis of the test remains H0 : ∆ = 0 in a transformed (and unknown) variable that renders F and G parallel, i.e., in any variable for which the location-shift model holds. And the original Mann-Whitney alternative H1 : F > G entails that the shift model holds. But although this reveals the generality of the circumstances in which the MW is a valid 4 test of location, it also leads to a problem in estimating the magnitude of the difference between F and G. One natural choice has been to estimate the shift parameter, ∆, and a popular estimate is that due to Hodges and Lehmann (1963). But the result in the Appendix shows it to have a severe shortcoming. Let us denote by xi and yj the elements of the samples from F and G. The HodgesLehmann estimate of ∆ is the median of all the differences (xi − yj ). The problem is that those differences are calculated in the untransformed scale of the original data, a scale in which the shift model may well not hold. In addition, note that for nonparallel F and G there are infinitely many monotone transformations that render F and G parallel because there are infinitely many choices for the value of a0 in the proof in the Appendix. Therefore there are infinitely many shift parameters that could imaginably be estimated, even though none of the appropriate rescalings are known. This suggests that estimating “the” shift parameter is not a useful activity. There is an approach to estimating the difference between F and G based on what Grissom and Kim (2012) call the Probability of Superiority, P (Xi > Yj ). Cliff (1996) used this approach defining δ = P (Xi > Yj ) − P (Xi < Yj ), a quantity he calls dominance (see also Agresti (2010)). Clearly this quantity ranges from -1 to +1 and is invariant over all monotone transformations of the variable in which X and Y are measured. Furthermore it is readily estimated by δ̂ = (UX − UY )/mn. Confidence intervals for it are discussed in del Rosal, San Luis and Sanchez-Bruno (2003). Confidence intervals for the Probability of Superiority are discussed in Newcombe (2006a and 2006b). The virtue of adopting one of these as the effect size is that they are immune to any single monotone transformation of both X and Y . Discussion of the power of the MW test should refer to power against non-null values of one of these quantities but not to differences between means or medians. Different choices of the monotonic transformation that renders F and G parallel can change the latter two 5 effect size measures. The Probability of Superiority and the dominance, however, remain stable under all such transformations. We note also that the sort of logic we apply here to generalizing the conditions under which the Wilcoxon-Mann-Whitney procedure is valid can be extended to the KruskalWallis test for three or more groups. However the specification of the sufficient conditions on the cdf’s that permits them to be rendered parallel becomes quite complex when the number of cdf’s grows beyond two, (Levine, 1970). Appendix: Transformations Rendering Two Curves Parallel Let F and G be increasing continuous functions on R with range (0, 1), and suppose G < F . We want to show there is a homeomorphism ϕ : R → R such that the graphs of F ◦ ϕ and G ◦ ϕ are horizontal translates of each other, i.e., such that for all x, (F ◦ ϕ)(x) = (G ◦ ϕ)(x + ∆) for some constant ∆. We will define ϕ as the limit of an inductively defined sequence of homeomorphisms. Define a sequence of nonnegative real numbers as follows. Fix a real number a0 . Assuming ai has been defined, let ai+1 = G−1 (F (ai )). Then {ai } is increasing since G < F . Moreover, we claim limi→∞ ai = ∞. For if not, {ai } is bounded and hence has a limit a by the completeness axiom. Then, since F and G are continuous, F (a) = lim F (ai ) = lim G(ai+1 ) = G(a), i→∞ i→∞ contradicting F > G. Let ϕ0 : R → R be the identity transformation. x, ϕ1 (x) = G−1 (F (x − a1 )), x + a2 − 2a1 , 6 Define a map ϕ1 : R → R by if x < a1 , if a1 ≤ x ≤ 2a1 , if x > 2a1 . Note that on each of the individual intervals above, ϕ1 is continuous and increasing. See Figures 1, 2, and 3. Also, G−1 (F (a1 − a1 )) = G−1 (F (0)) = a1 G−1 (F (2a1 − a1 )) = G−1 (F (a1 )) = a2 = 2a1 + (a2 − 2a1 ). Thus ϕ1 is continuous at a1 and a2 . Hence ϕ1 is an increasing homeomorphism R → R. Finally, if x ∈ [0, a1 ], then F (ϕ1 (x)) = F (x) = G(G−1 (F (x))) = G(ϕ1 (x + a1 )). Thus the graph of G ◦ ϕ1 on [a1 , 2a1 ] is a translate (by a1 ) of the graph of F ◦ ϕ1 on [0, a1 ]. Now suppose we’ve defined an increasing homeomorphism ϕn : R → R with the properties that 1. ϕn (ka1 ) = ak for k = 1, 2, . . . , n + 1. 2. F (ϕn (x)) = G(ϕn (x + a1 )) for x ∈ [0, na1 ]. Observe that ϕ1 satisfies these properties with n = 1. Now define ϕn+1 : R → R by ϕn (x), if x < (n + 1)a1 , ϕn+1 (x) = G−1 (F (ϕn (x − a1 ))), if (n + 1)a1 ≤ x ≤ (n + 2)a1 , x + an+2 − (n + 2)a1 , if x > (n + 2)a1 . Note that on each of the individual intervals above, ϕn+1 is continuous and increasing. Also, ϕn+1 ((n + 1)a1 ) = G−1 (F (ϕn ((n + 1)a1 − a1 ))) = G−1 (F (ϕn (na1 ))) = G−1 (F (an )) 7 = an+1 = ϕn ((n + 1)a1 ) and φn+1 ((n + 2)a1 ) = G−1 (F (ϕn ((n + 2)a1 − a1 ))) = G−1 (F (ϕn ((n + 1)a1 ))) = G−1 (F (an+1 )) = an+2 = (n + 2)a1 + [an+2 − (n + 2)a1 ]. This shows shows that ϕn+1 is continuous at (n + 1)a1 and (n + 2)a1 , hence is an increasing homeomorphism R → R. Since ϕn satisfies property 1 above and ϕn+1 agrees with ϕn on (−∞, (n + 1)a1 ], the second calculation shows that ϕn+1 also satisfies property 1 with n + 1 replacing n. Similarly, since ϕn+1 and ϕn agree on (−∞, (n + 1)a1 ], and since ϕn satisfies property 2, we have that F (ϕn+1 (x)) = G(ϕn+1 (x + a1 )) for x ∈ [0, na1 ]. Now suppose that x ∈ [na1 , (n + 1)a1 ]. Then F (ϕn+1 (x)) = F (ϕn (x)) (by the definition of φn+1 ) = G(G−1 (F (ϕn (x)))) = G(φn+1 (x + a1 )) (by the definition of φn+1 ). Thus φn+1 satisfies property 2 with n + 1 replacing n. By induction, we have therefore constructed an infinite sequence {ϕn } of increasing homeomorphisms with the above properties. Define ϕ+ : R → R by ϕ+ (x) = lim φn (x). n→∞ 8 For any x, this limit exists since the terms of the sequence are actually equal for n sufficiently large. Moreover, ϕ+ is continuous since ϕ+ agrees with the continuous function φn on a neighborhood of x for n sufficiently large. It follows that ϕ+ : R → R is an increasing homeomorphism. Moreover, it follows from property 2, that F (ϕ+ (x)) = G(ϕ+ (x + a1 )) for x ∈ [0, ∞). A similar argument proceeding in the negative direction will produce a homeomorphism ϕ : R → R such that F (ϕ(x)) = G(ϕ(x + a1 )) for x ∈ (−∞, ∞). We note that it is easily seen that different choices of a0 will, in general, give rise to different homeomorphisms ϕ. References Agresti, A. (2010), Analysis of Ordinal Categorical Data (2nd ed.) New York: Wiley. Cliff, N. (1993), Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions, Psychological Bulletin, 114, 494–509. Conover, W. J. (1999), Practical Nonparametric Statistics (3rd ed.) New York: Wiley. del Rosal, A. B., San Luis, C., & Sanchez-Bruno, A. (2003), Dominance Statistics: A Simulation Study on the d Statistic, Quality & Quantity, 37, 303–316. Gibbons, J. D., & Chakraborti, S. (2011), Nonparametric Statistical Inference (5th ed), New York: Marcel Dekker. Grissom, R. J., & Kim, J. J. (2005), Effect Sizes for Research. (2nd ed.), New York: Routledge/Taylor & Francis. Hettmansperger, T. P., & McKean, J. W. (2011). Robust Nonparametric Statistical Methods (2nd ed.). Boca Raton, FL: Chapman & Hall, CRC. Higgins, J. J. (2004), Introduction to Modern Nonparametric Statistics, Belmont CA: Duxbury. 9 Hodges, J. L., & Lehmann, E. L. (1963), Estimates of Location Based on Rank Tests, Annals of Mathematical Statistics, 34, 598–611. Hollander, M., & Wolfe, D. A. (1999), Nonparametric Statistical Methods (2nd ed.), New York: Wiley. Lehmann, E. J. (1986), Testing Statistical Hypotheses (2nd ed.), New York: Wiley. Lehmann, E. J. (1998), Nonparametrics: Statistical Methods Based on Ranks, Upper Saddle River NJ: Prentice-Hall. Levine, M. V. (1970), Transformations That Render Curves Parallel, Journal of Mathematical Psychology, 7, 410–443. Mann, H.B., & Whitney, D.R. (1947), On a Test of Whether One of Two Random Variables is Stochastically Larger Than the Other, Annals of Mathematical Statistics, 18, 5060. Neuhuser, M. (2012), Nonparametric Statistical Tests: A Computational Approach. Boca Raton, FL: Chapman & Hall, CRC. Newcombe, R. G. (2006a), Confidence Intervals for an Effect Size Measure Based on the Mann-Whitney statistic. Part I: General Issues and Tail-area-Based Methods, Statistics in Medicine, 25, 543–557. Newcombe, R. G. (2006b), Confidence Intervals for an Effect Size Measure Based on the Mann-Whitney statistic. Part 2: Asymptotic Methods and Evaluation, Statistics in Medicine, 25, 559–573. Sprent, P., & Smeeton, N. C. (2007), Applied Nonparametric Statistical Methods (4th ed.), Boca Raton, FL: Chapman & Hall/CRC. 10 Figure Captions Figure 1 Consider two continuous, monotonic curves 0 ≤ G(x) < F (x) ≤ 1 on the real line. Define a1 = G−1 (F (0)) and a2 = G−1 (F (a1 )). Figure 2 The interval [a1 , a2 ] is first transformed by ϕ1 so that G (in yellow) on the interval [a1 , 2a1 ] is a translation of F (in red) on the interval [0, a1 ]. Note F (in green) on this interval, above F (a1 ), is also affected. Figure 3 The transformed functions are now continuous only up to 2a1 . To restore full continuity, F and G are redefined on the interval [2a1 , ∞) as F (x + (a2 − 2a1 )) and G(x + (a2 − 2a1 )), producing F ◦ ϕ1 and G ◦ ϕ1 . See text for more details. 11