Full Text: Copyright American Statistical Association Feb 2000 [Headnote] The Wilcoxon-Mann-Whitney test enjoys great popularity among scientists comparing two groups of observations, especially when measurements made on a continuous scale are non-normally distributed. Triggered by different results for the procedure from two statistics programs, we compared the outcomes from 11 PC-based statistics packages. The findings were that the delivered p values ranged from significant to nonsignificant at the 5% level, depending on whether a large-sample approximation or an exact permutation form of the test was used and, in the former case, whether or not a correction for continuity was used and whether or not a correction for ties was made. Some packages also produced pseudo-exact p values, based on the null distribution under the assumption of no ties. A further crucial point is that the variant of the algorithm used for computation by the packages is rarely indicated in the output or documented in the Help facility and the manuals. We conclude that the only accurate form of the Wilcoxon-MannWhitney procedure is one in which the exact permutation null distribution is compiled for the actual data. KEY WORDS: Asymptotic; Continuity correction; Exact permutation test; Statistical software; Ties. 1. INTRODUCTION The Wilcoxon-Mann-Whitney (WMW) test is very popular in the applied sciences, especially in the life and social sciences, and specifically in the biomedical sciences (Ludbrook and Dudley 1998). It is frequently used as the nonparametric analog of the Student's t test to compare two sets of observations measured on an interval scale when it is supposed that the data are non-normally distributed. It is also used, especially in the social sciences, when the original measurements were made on an ordinal scale. There have been reviews of how accurately microcomputer statistics packages perform the WMW test (Bernhard, Alle, Herbold, Meyers 1988; Ludbrook 1995). The purpose of this review is to report our experience of using 11 commercial statistical packages to execute the Wilcoxon-Mann-Whitney test on a genuine set of experimental data. We focus on PC-based software packages for MS-Windows, which are mainly menu driven and widely used by applied scientists from many disciplines. The starting point was a real dataset from a pharmacological experiment with a well-established paradigm. When we discovered that two commercial statistical packages gave very different outcomes from the Wilcoxon-Mann-Whitney test, we sought an explanation. This led to our examining other statistical software packages. 2. THE WILCOXON-MANN-WHITNEY TEST The literature refers to equivalent tests, formulated in different ways, as the Wilcoxon rank-sum test and the MannWhitney U test. These were developed independently by Wilcoxon (1945) and Mann and Whitney (1947). A detailed theoretical treatment of the tests) was given by Lehmann (1998). The Wilcoxon rank-sum procedure, including formulations of its variants, was described by Siegel and Castellan (1988) in terms that are comprehensible to nonstatisticians. We have adopted the widely used convention of combining the two versions as the Wilcoxon-MannWhitney (WMW) test. It is worth noting that the majority of texts on nonparametric statistics favor the Wilcoxon version of the test, in which the rank-sums of the groups are calculated, over the Mann-Whitney version in which a U statistic is used. 2.1 Nature of the Hypothesis Tested by the WMW Procedure This depends on whether one adheres to the classical, population, model of inference, or to the randomization model. This, in turn, depends on the sampling technique employed. Under the classical (population) model the null hypothesis is that the two populations have the same response distribution against the alternative that they are different. The WMW test is used to detect "shifted alternatives". That is, the two population distributions have the same general shape (including dispersions), but one of them is shifted relative to the other by a constant amount under the alternative hypothesis. The statistical inference is generalizable to future, similar experiments in which random sampling is employed. Under the randomization model, which is the norm in biomedical research (Ludbrook and Dudley 1998), the WMW procedure tests specifically whether there is a difference between randomized groups in terms of their mean ranks. The statistical inference applies only to the actual experiment performed. Whichever model of inference is appropriate to the experimental design, biomedical investigators are encouraged to postulate a nonspecific alternative hypothesis. That is, they look to two-sided p values rather than one-sided. We concur with this approach. Wilcoxon described his rank-order test as based on exact permutation, but permutation of differences between mean-ranks (or, exactly equivalent, rank-sums) so as to reduce the computational difficulties presented by R.A. Fisher's permutation procedures for differences between means (Wilcoxon 1945). However, it is a common misapprehension that the WMW procedure tests for equality of group medians. This is wrong. It tests for equality of group mean ranks and, because of the ranking system employed, mean ranks do not correspond to medians. 2.2 Variants of the WMW Procedure There are at least three different situations with two options each which have to be distinguished when calculating the distribution of the WMW test statistic. These were described lucidly by Siegel and Castellan (1988) and in greater depth by Lehmann (1998). *Large-sample (asymptotic) approximation: (normal or x^sup 2^ distribution) versus exact form (exact permutation distribution) *Continuity: large sample approximation with or without correction for continuity *Ties: large sample approximation with or without correction for ties Exact form. In most cases the exact permutation form of WMW test should be the preferred method. This is especially so if the experimental groups are small and, whatever their sizes, if they are constructed by randomization rather than random sampling (see Ludbrook and Dudley 1998). Under the assumption of no ties, the form of the exact null distribution of the Wilcoxon rank-sum statistic depends only on the total number of observations in the two groups (N = n1 + n 2). It is therefore relatively easy to tabulate the exact distribution for small group sizes, say up to n = 10. If there are tied observations, however, the above no longer holds true, and for a given value of N the exact null distribution of the Wilcoxon rank-sum statistic depends also on the number and pattern of tied values within and between the groups (Lehmann 1998). Though there are published tables of exact p values for small groups-for instance, n <= 10 (Siegel and Castellan 1988; Lehmann 1998)-these are correct only if there are no ties. Software is now available, however, with which the exact permutation version of the WMW test, or a Monte Carlo sampled exact version, can be executed for groups of any size and regardless of the number and pattern of ties. Large sample (asymptotic) approximation. What should be regarded as a large sample is quite vague. In view of the restricted tables for the exact permutation version (see above), most investigators are accustomed to using an asymptotic approximation when group sizes exceed 10. The latter is either based on the normal distribution (usually) or the X^sup 2^ distribution (sometimes). The only difference is that the X^sup 2^ distribution gives an intrinsically two-sided outcome, whereas the normal distribution can be employed in a oneor two-sided fashion. Continuity correction. When the normal approximation is used, this correction allows for the fact that the normal distribution is continuous whereas the distribution of the WMW statistic is discrete (see Siegel and Castellan 1988). This correction reduces the value of the numerator of the z statistic and therefore renders the outcome more conservative (i.e., gives larger p values). Ties and corrections for ties. When observations made on an interval scale are transformed into their corresponding ranks, equal (tied) values are assigned the mean rank across the tie. When ties occur only within the first or within the second group, they do not affect the outcome of the largesample approximation. However, when there is at least one observation in the first group and at least one from the second that share a common rank, the asymptotic version of the WMW procedure is rendered too conservative. A correction for ties reduces the value of the denominator of the z statistic, and so renders the outcome of the WMW procedure less conservative (i.e., gives smaller p values) (Siegel and Castellan 1988). 3. THE EXPERIMENT 3.1 Experimental Protocol Animals. Male rats (strain: Sprague Dawley; BRL, Switzerland; body weight: 100-140 grams) were housed in groups of four animals (type IV cages) in a temperaturecontrolled room under artificial illumination (6:00 a.m.-6:00 p.m., lights on) with free access to water and food. Rotarod test. The rotarod apparatus consists of a rotating cylinder, which is divided into four available rat positions, each six centimeters (cm) in diameter. The cylinder is rotated at a speed of 12 rotations per minute (rpm). The rats are placed singly on the cylinder. One day before the experiment the animals were trained to stay on the rotarod for 300 seconds. Rats that failed to learn the test were excluded from the study. During the test phase the length of time each rat remains on the cylinder (the endurance time) is measured, up to a maximum of 300 seconds. Treatment. The animals received a fixed oral dose of a centrally acting muscle relaxant (treatment) or a saline solvent (control). In each experiment, 24 rats were assigned to control or treated groups by restricted randomization. The rotarod endurance time was measured three hours after administration of the active agent or saline, because the maximal effect of the compound was observed at this timepoint. 3.2 Results of the Experiment These are displayed in Table 1. It is obvious that the data resulting from the experiment could not be analyzed by the Student's t test, whatever transformation were to be employed. It was decided, therefore, to use the WMW procedure to test whether the mean ranks of the two groups were equal. Note that whereas the mean values and mean ranks for endurance times of the two groups are different, the group medians are identical. 4. RESULTS OF APPLYING THE WMW TEST Originally, the analysis was done using SigmaStat 2.03 Enlarge 200% Enlarge 400% Table 1. (SPSS Inc., Chicago) and then compared with the results from SYSTAT 9 (SPSS Inc., Chicago). Because the outcomes were entirely different, the analysis was repeated with the following packages: JMP 3.2.5 (SAS Institute Inc., Cary, NQ; S-Plus 2000 (MathSoft, Inc., Seattle); STATISTICA 99 Ed. Rel. 5.5 (StatSoft, Inc., Tulsa, OK); UNISTAT 4.53b (Unistat Ltd., London); SPSS 8.0 (SPSS Inc., Chicago); Arcus Quickstat, Biomedical Version 1.2 (Research Solutions, Cambridge, UK); Stata 6.0 (Stata Corporation, College Station, TX); and SAS 6.12 (SAS Institute Inc., Cary, NQ. StatXact 4.0 (Cytel Software Corporation, Cambridge, MA) was used to perform the WMW test by exact permutation, the outcome of which was used as the ultimate benchmark of accuracy. However, before embarking on the comparative study of statistical packages, we carried out the WMW test on the experimental data, using the normal approximation with and without corrrections for ties and continuity, performing the ranking by hand, using a hand-held calculator to execute the various formulations given by Siegel and Castellan ( 1988), and referring the z statistic to computer tables of the normal distribution (StaTable 1.0, Cytel Software Corporation, Cambridge MA). The results of the study are summarized in Tables 2 and 3. Calculation by hand: The results were as follows, sum of ranks 120, total rank sum 300: SigmaStat: The medians were printed for both groups as the essential parameter under investigation, accompanied by a P value of .085. The following text was displayed to interpret the findings: "The differences in the median values among the two groups are not great enough to exclude the possibility that the difference is due to random sampling variability; there is not a statistically significant difference (p = 0.085)". Enlarge 200% Enlarge 400% Comment. There is no reference in the Help file or manual about the method used to execute the test. Moreover p = .085 does not correspond to any of the hand-worked outcomes. SYSTAT: The WMW test is handled as a special, twogroup case of the Kruskall-Wallis test. The output included the rank sum for each group, the Mann-Whitney U statistic = 42, the chi-square approximation = 5.948 with 1 df, and a corresponding probability of .01473. Comment. There is no mention in the Help file or manual about corrections for ties or continuity. We infer from the hand-worked outcomes that a correction for ties, but not for continuity, was used. JMP: The output included the rank sum and rank mean for each group (denoted as score sum and mean), a "2sample normal approximation" for the rank sum z = 2.398, p = .0165 and a "1-way Test Chi-Square approximation" of 5.948 with 1 df and probability of .0147. Comment. It is not explained why the outcomes of using the normal and chi-squared approximations are different. We infer from the hand-worked results that p = .0165 was obtained by making corrections for both continuity and ties, p = .147 by making a correction only for ties. S-Plus: Two options are provided in the starting window for the WMW test: "Continuity Correction" and "Use Exact Distribution". If "Continuity Correction" was selected, the output was: rank-sum normal statistic without correction z = 2.4389, p-value = .0147; rank-sum normal statistic with correction z = 2.3983, p-value = .0165. Comment. On the basis of the hand-worked outcomes it appears that the p values are also corrected for ties, but this is not mentioned explicitly in the output. However, a detailed description of how the WMW procedure is implemented is given in the Help file. If "Use Exact Distribution" was selected, the "Warning messages: cannot compute exact p value with ties in: wil.rank.sum(x, y, alternative, exact, correct)" appeared which may mean that the Dineen-Blakesley (1973) algorithm is used. Comment. This implies that the S-PLUS algorithm is less versatile in performing exact permutation than that of StatXact. STATISTICA: The rank sum for each group was given, and as the primary result z = 1.732 and p = .083274 was reported. Then an "adjusted" z = 2.4389 with p = .014737 is displayed. Finally, an "exact" p = .088734 is provided. Comment. In the Help menu it is explained that the "adjusted" z statistic was adjusted for ties, and that for small group sizes exact probabilities are "based on the enumeration of all possible values of the Mann-Whitney U statistic (unadjusted for ties), given the number of observations in the two samples (see Dineen and Blakesley 1973)". Later this is paraphrased as: "To reiterate, the computations for this probability value (for small to moderate sized samples) are based on the assumption of no ties in the data (ranks). Note that this limitation usually leads to only a small underestimation of the statistical significance of the respective effects (see Siegel 1956)". Our example shows that the "underestimation of the statistical significance" can be considerable. Nothing is mentioned about continuity correction. From the hand-worked output, it appears that the three p values are, respectively, the outcome with no corrections, the outcome after a correction for ties, and the outcome of a simplified "exact" algorithm that is valid only when there are no ties. UNISTAT: The output included group rank sums and mean ranks, the WMW test statistic and the statistic labeled as corrected for ties, z = -2.4389 with p = .0147, and an "exact" probability of 0.0887. "Difference between Medians = 0" and a 95% confidence interval for differences between medians based on normal approximation was reported as 0 to 137. Comment. In the Help file it is stated that for a total sample size N < 30 an "exact" significance level is reported (referenced as employing the algorithm by Dineen and Blakesley 1973, though this cannot cope with ties). Enlarge 200% Enlarge 400% Table 2. SPSS: The output included rank sums and mean ranks for each group, the WMW test statistic with an associated z = -2.439 and a corresponding asymptotic significance level of p = .015. An "exact" significance probability of .089, marked as not corrected for ties, was also displayed. There is now available an optional module "SPSS Exact Tests", which was developed by the Cytel Software Corporation, the vendor of StatXact, so that the statistical methods provided are very similar. If the "SPSS Exact Tests" module is included in your SPSS license you have an option; "exact", and if it was chosen an additional exact (2-tailed) p = .037 is reported. Comment. The Help menu describes the Mann-Whitney test as: "A nonparametric equivalent to the t test. Tests whether two independent samples are from the same population" and that "the average rank (is) assigned in the case of ties:' But nothing is mentioned about corrections for continuity or exact probabilities. From the hand-worked outcomes, it appears that the first p value results from a correction for ties but not continuity. The second, "exact" p value presumably results from using the Dineen and Blakesley (1973) algorithm, even though this is invalid when there are ties. However, the SPSS Exact Tests module provides the correct outcome. Arcus Quickstat: The group medians and rank sums, together with the WMW statistic, were displayed. Exact probabilities (one- and two-sided and adjusted for ties) were reported as .0186 and .0373, and a "95.5% confidence interval for differences between medians or means" was reported as 0 to 137 with a "Median difference = 0". Comment. The Help file explains that the sampling distribution of the WMW statistic is used to calculate exact probabilities and that this can take a long time if there are tied data. Nevertheless, the "exact" probability does result from listing the permutation distribution and so is genuinely exact. It is not indicated how the confidence interval is arrived at, and it contradicts the p values since it includes zero. Stata: For the WMW test, the rank sums and the "adjusted for ties variance" are reported, and the outcome as z = 2.439 and p = .0147. If the Kruskal-Wallis procedure is used for two groups, a "chi-squared = 3.000 with 1 d.f., p = .0833" is reported. Comment. There is nothing in the Help file about the statistical algorithm used in the WMW test. In the manuals the normal approximation is described, which includes the case that there are ties. With reference to the KruskalWallis procedure, under Methods and Formulas in the manual it is stated that "Tied values are assigned the average ranks." However, to judge from the hand-worked outcomes, the p-value resulting from the chi-square approximation is not corrected for ties or for continuity. In the manual, the general description of the WMW procedure says correctly that the hypothesis is tested "that two independent samples . . . are from populations with the same distribution . . ." but the example given describes (incorrectly) the outcome of the WMW procedure as that "The results indicate that the medians are not statistically different at . . .". SAS 6.12: PROC NPAR1WAY reports the mean ranks (denoted as scores) and a message that "Average Scores Were Used for Ties". Then the "normal approximation (with continuity correction of 0.5)" is given with z = 2.39826, p = 0.0165. The "Kruskal- Wallis Test (Chi-Square Approximation)" with p = 0.0147 is also printed. In PROC NPAR1WAY there is also available an "exact" option, which gave a two-sided exact pvalue of .0373. Furthermore, there is an additional procedure, PROCSTATXACT, provided by the Cytel Software Corporation, which is equivalent to StatXact and also results in p = .0373. Comment. The user may be confused by the many options provided in SAS 6.12, even though these are fully documented in the comprehensive manuals. Note that the "exact" option in PROC NPAR 1 WAY and PROC-STATXACT give identical outcomes, both being based on the permutation distribution. StatXact: Our dataset was sufficiently small to allow the exact permutation procedure to be followed, rather than a Monte Carlo sampled permutation distribution. Exact, oneand two-sided, inferences were reported as p = .0186 and p = .0373, respectively. The asymptotic outcome reports the WMW statistic, a standardized z value of 2.439, and a two-sided p = .0147. Comment. In the Help file there is little information about the statistical methods, but these are described in fine detail in the manual. By reference to the hand-worked outcomes, a correction for ties is employed in the asymptotic outcome, though this is not made clear in the manual. Modification of the Dataset: We repeated the calculation for some modifications of our original dataset to illustrate the influence of ties: * Changing one observation in the control group from 300 to 299: SYSTAT calculated an asymptotic probability of .044 corrected for ties but not for continuity. StatXact reported an exact p value of .0373 which is identical to the value for the unchanged data. This shows very well how much the large sample approximation is influenced by ties and the advantage of the exact procedure. * The seven "300" values of the original data in the treatment group were modified slightly to 297, 298, 299, 301, 302, 303, 304 so that no ties are present: SYSTAT gave an asymptotic probability corrected for ties but not for continuity of .13867 and StatXact an exact p = .1461. 5. GENERAL COMMENTARY It seemed to us that the WMW procedure was ideally suited to the analysis of our somewhat unusual dataset (Table 1 ). But from the point of view of nonstatisticians (the majority of bioscientists and biomedical investigators), the results of our empirical study are quite dismaying. The WMW procedure tests for equality of group meanranks, not of group medians. This is evident from our experimental data (Table 1). However, by providing group medians or their difference in their outputs, statistics packages such as SigmaStat, Unistat, Stata, and even Arcus QuickStat may mislead investigators into supposing that the p values refer to the hypothesis that group medians are equal. This common misapprehension is not unique to statistics packages. It appears in Siegel and Castellan (1988) and many other elementary texts on statistics. On theoretical grounds, it is clear that the only infallible way of executing the WMW test is to compile the null distribution of the rank-sum statistic by exact permutation. This was, in effect, Wilcoxon's ( 1945) thesis and it provided the theoretical basis for his test. The specialized statistics package, StatXact, executed the WMW procedure in this way. Of the general packages we reviewed, Arcus Quickstat and SAS executed the WMW test in this way and, in the more recent versions of SAS and SPSS, modules are available that are based on StatXact and execute the WMW test by exact permutation. In all these cases, the two-sided outcome was p = .0373. To change tack, in the case of our experiment, the exact permutation procedure for equality of group means also resulted in p = .0373 (StatXact), though this is not always so. It is up to investigators to decide whether a test for equality of group mean-ranks (but not of group medians) is more informative than one for equality of group means. Three packages claimed to execute the WMW procedure in an "exact" fashion: STATISTICA, UNISTAT, and SPSS. In each case the result was p = .0887 (or .089). The packages refer to Dineen and Blakesley (1973) for the algorithm used to calculate their "exact" form of the WMW. A closer look at calculating the exact distribution (Lehmann 1998) shows that this algorithm relies only on the sizes of the two groups, which is only correct for untied data since the number of ties is not taken into consideration. In our view, to output such "exact" p values in the obvious case of tied data is dangerously misleading and results in no more than pseudo-exact outcomes. It should also be noted that published tables of exact outcomes of the WMW procedure (Siegel and Castellan 1988; Lehmann 1998) are invalid when ties are present. The several p values provided within the packages are likely to confuse rather than instruct the biomedical investigator (and even the unwary statistician), especially since the formulations of the WMW test which result in the different p values are not clearly defined. Scientists tend to look for "significant" results from their experiments, so that some may be inclined to select p = .0147 (which results from using the normal approximation with a correction for ties but not for continuity). Enlarge 200% Enlarge 400% Table 3. A survey of the type of p values produced by all the reviewed packages is given in Table 2. 6. CONCLUSIONS We summarize our investigation in the following points and conclude with some recommendations. 1. In general, microcomputer statistics packages provide very inadequate documentation in their manuals and Help files of precisely how the WMW test is executed (see Table 3). As a consequence, the results can be dangerously misleading. It is essential that explicit documentation be given. The user must be in no doubt about which formulations of the WMW test are used. 2. The output of the results should be clear, fully explained, and comprehensible to nonstatisticians. 3. Different microcomputer statistics packages can give very different outcomes for the WMW test (see Table 2). 4. Investigators cannot rely on the popular, generalpurpose, microcomputer statistics programs which are reviewed here to provide an accurate outcome from the WMW test. This is because the programs usually use one or more versions of the large-sample (asymptotic) approximation. There are exceptions to this statement: Arcus Quickstat, SAS, and also additional new modules for SPSS and SAS, execute the test by exact permutation. The specialized package StatXact always uses exact permutation. 5. If investigators use a statistics package that we have not reviewed here to execute the WMW test, we strongly recommend they should in the first instance analyze their data by hand, or use the example we give here in Table 1, to establish which variant of the test is executed (see Table 2). 6. If the original data are in ranked form, the WMW procedure is the best available, provided the test is executed by exact permutation (e.g., StatXact, Arcus Quickstat, SAS and special modules in SAS and SPSS). 7. If the original data are in continuous (interval scale) form, but are clearly non-normally distributed or have been acquired by randomization rather than random sampling, a permutation (randomization) test for equality of group means may be a better option than the WMW test for equality of mean ranks. This can be executed by StatXact. [Received March 1999. Revised October 1999.] [Reference] REFERENCES [Reference] Bernhard, G., Alle, M., Herbold, M., and Meyers, W. (1988), "Investigation on the Reliability of Some Elementary Nonparametric Methods in Statistical Analysis Systems," Statistical Software Newsletter, 14, 19-26. Dineen, L. C., and Blakesley, B. C. (1973), `Algorithm AS62: A Generator for the Sampling Distribution of the MannWhitney U Statistic," Applied Statistics, 22, 269-273. Lehmann, E. L. (1998), Nonparametrics: Statistical Methods Based on Ranks (revised 1st ed.), Upper Saddle River, NJ: Prentice Hall. Ludbrook, J. (1995), "Microcomputer. Statistics Packages for Biomedical Scientists," Clinical and Experimental Pharmacology and Physiology, 22, 976-986. [Reference] Ludbrook, J., and Dudley, H. (1998), "Why Permutation Tests are Superior to t and F Tests in Biomedical Research," The American Statistician, 52, 127-132. Mann, H. B., and Whitney, D. R. (1947), "On a Test of Whether One of Two Random Variables is Stochastically Larger than the Other," Annals of Mathematical Statististics, 18, 50-60. Siegel, S., and Castellan, N. J. (1988), Nonparametric Statistics for the Behavioral Sciences (2nd ed.), New York: McGrawHill. Wilcoxon, F. (1945), "Individual Comparison by Ranking Methods," Biometrics, 1, 8Q-83. [Author note] Reinhard Bergmann and Will Spooren are Scientists with Novartis Pharma, Department of Research, P.O. Box CH-4002 Basel, Switzerland (E-mail: reinhard.bergmann@pharma.Novartis.com). John Ludbrook is Professorial Fellow with the University of Melbourne, Department of Surgery, Royal Melbourne Hospital, Parkville, Victoria 3050, Australia. The authors thank SPSS, Inc., Switzerland (SPSS 8.0); Research Solutions, Cambridge, UK (Arcus Quickstat, Biomedical Version 1.2); and Stata Corporation, College Station, TX, (Stata 6.0) for providing copies of the software for evaluation.