NONPARAMETRIC TESTS OF INDEPENDENCE AND DISPERSION EXTENSIONS TO A COLLECTION OF SAS MACROS FOR EXACT NONPARAMETRICS Lars Pralle!, Thomas Bregenzer2, Olaf Gefeller!, Edgar Brunner! ! Abteilung Medizinische Statistik, Universitat Gottingen 2 Institut fiir Biometrie and Epidemiologie, Tierarztliche Hochschule Hannover 1 Introduction Nonparametric methods have become a tool frequently used in many fields of statistical analysis (e. g. biostatistics). In the NPAR1WAY and UNIVARIATE procedures, SAS provides a selection of useful nonparametric tests with decision rules based on approximations for large sample sizes (mostly the normal approximation). In practice, however, small sample sizes are often encountered and the validity of the approximations may be doubtful, especially when ties occur in small samples. In these situations the knowledge of the exact distribution of the non parametric test statistic is essential to ensure the validity of the statistical analysis. On the 1992 SEUGI conference in Vienna, Bregenzer, Gefeller and Brunner [2] presented an easy-to-handle SAS-macro package which covered many standard testing problems in one and two sample location designs. The exact distributions of the test statistics are computed by shift algorithms (see e. g. [9], [14]). These algorithms allow the computation for growing sample sizes with a reasonable amount of time and computer resources. The conception of this macro-collection (now called NPAR) allows to extend it to other statistical settings. The existing collection provided tests for one and two sample location designs, namely the Sign test, the Wilcoxon Signed Rank test for one sample problems, the Wilcoxon-Mann-Whitney test (extended to repeated measurements per case) for independent two sample location problems, the Wilcoxon Signed Rank test (extended to repeated measurements per case), and a two sample test with repeated measurements in an unbalanced design. In the extended version of NPAR, the following new situations are covered: • correlation of two variables - Kendall's T and a test of association based on it • two sample scale problems - Ansari-Bradley test - Siegel-Tukey test 826 The new macros are embedded within the existing architecture of NPAR and use, as far as possible, the same symbolic names. The user-interface was designed to make the macros easy to work with to anyone familiar with SAS procedures. In the following, we recall briefly the features of the macro collection and give a description of the syntax of the new macros. 2 Notation • Xi, l'i are random variables (rv), F, G denote the corresponding cumulative distribution functions ( cdf). • limiting processes, for example (Xn -+ X), are always meant as limn_co. • Tn ~ N(O, 1) means convergence in distribution under Ho to N(O, 1). • iid means independent identically distributed. • w.l.o.g. means without loss of generality. • #{ ...} denotes the number of elements of a set. 3 Old Macros The spectrum of nonparametric methods covered by NPAR in the former version is delineated below. • One Sample Problems - Sign test (module %SIGN1) - Wilcoxon Signed Rank test (module %W/LCl) • Independent Two Sample Problems - Conventional Wilcoxon-Mann-Whitney test (module %W/LC21) - Extension of the Wilcoxon-Mann-Whitney test to deal with repeated observations per case [3, 5] (module %W/LC21) • Paired Two Sample Problems - Extended Wilcoxon ranksum test with repeated measurements per case (module %W/L C20) - Two sample test with repeated measurements in an unbalanced design [4] (module %BROE20) 827 4 New Macros 4.1 Two-Sample Dispersion Problems 4.1.1 Ansari-Badley Test • Model Xli=O'lfi+JL, i=1, ... ,m X 2j = 0'2fj+m + JL, j = 1, .. . ,n f j iid rv with 0 median. Note that the two groups of rv are assumed to have the same location parameter JL. • Effect parameter , 0'2 := - 0'1 which is to be interpreted as follows: , <1 , = 1 , >1 the distribution of the Xli is broader, both distributions have the same dispersion, the distribution of the X 2j is broader. • Hypothesis Ho:, = 1 • Statistic and Distribution To compute the Ansari- Bradley statistic special scores are attributed to the observations: Let Z = (Xu, ... , X 1m , X 21l ••• , X 2n ) denote the combined samples. Sample if N = m Z: Z(1) Z(2) 1 2 N "2 "2 lY. - 1 2 1 2 + n is even and Sample Zi: Z(1) Z(2) Z((N-1)/2) Z((N+1)/2) Z((N+3)/2) 1 2 !f.:=1. lY.±l N-l 2 Scores Ai: if N = m N lY. - 1 2 +n 2 2 2 1 is odd. The test statistic of the Ansari-Bradley test is defined as the sum of scores in one (w.l.o.g. the first) sample [1]: m W = LAli ;=1 The exact distribution of W can be obtained from the following recurrence relation: rm-1,n(J - k) rm.n(W) with k= [N : + rm,n-1(J) 1], [.J denoting the integer function. 828 The probability of W sample is then =W given m observations in the first sample and n observations in the second Pm,n(W) = For large samples the following normal approximation is used: W - E(W) JVar(W) .!!.!4 N(O 1) , , where m(m~n+2) E(W) { m(m + n + 1)2 4(m + n) Var(W) = { if m +n if m + nis odd is even . mn(m + n + 2)(m + n - 2) 48(m + n - 1) mn( m + n + 1) [3 + (m + n )2] 48(m + n)2 if m + n is even if m +n . is odd if F is continuous and mn [16 it tjr] - (m + n)(m + n + 2)2] 16(m+n)(m+n+ 1) Var(W) mn [16(m + n) it tjr] - (m if m +n is even if m +n is odd + n + 1)4] 16(m + n)2(m + n - 1) if ties are present. Where in the latter case 9 denotes the number of tied groups of observations in the combined sample having length tj and being assigned the average score rj. Untied observations are considered as ties of length 1. • Program Call %ANSBRA ( <DATA = dataset ,> VAR = variable <,BY = by variable> <, MIN _ASYM = lower bound for asymptotics > <, MAX _ EXAC = upper bound for exact calc> ); where dataset = names the SAS data set containing the data to be analysed, default is _LAST_ (as in SAS) 829 variable = by variable = lower bound for asymptotics = upper bound for exact calc = 4.1.2 names the response variable to be tested (as in SAS) leads to seperate analyses on observations in groups defined by the specified by variable (as in SAS) smallest sample size for asymptotic results, default is 10 largest sample size for exact results, default is 100 Siegel-Tukey Test • Model X li X 2j fi =O'lfi+JL, i=1, ... ,m j = 1, ... ,n = 0'2 f j+m +JL, iid rv with zero median. • Effect parameter which is to be interpreted as follows: ,<1 ,=1 ,>1 the distribution of the Xli is broader, both distributions have the same dispersion. the distribution of the X 2j is broader. • Hypothesis Ho : , = 1 • Statistic and Distribution The scores for the Siegel-Tukey- Test are given following this sceme: (Xu, ... , X 1m , X 2b •• • , X 2n ) denote the combined samples. Let Z = 1 4 5 8 7 6 3 2 + if N = m n is even; if N is odd. the median of the sample is dropped and the same procedure is applied to the reduced sample. The test statistic of the Sigel-Tukey test is defined as the sum of scores in one (w.l.o.g. the first) sample [12]: m S=I: S1i i=l As this test is based on a permutation of the Wilcoxon-scores. S has the same distribution under H 0 as the statistic of the Wilcoxon-Mann-Whitney test - as implemented in the module WILC21 (cf. [6]). 830 • Program Call %SIETUK ( <DATA = dataset ,> VAR = variable <,BY = by variable> <, MIN _ASYM = lower bound for asymptotics > <, MAX _ EXAC = upper bound for exact calc> ); where dataset = variable = by variable = lower bound for asymptotics upper bound for exact calc 4.2 = = names the SAS data set containing the data to be analysed, default is _ LAST_ (as in SA 5) names the response variable to be tested (as in sAs) leads to seperate analyses on observations in groups defined by the specified by variable (as in SA 5) smallest sample size for asymptotic results, default is 10 largest sample size for exact results, default is 100 Two-Sample Correlation Problems Kendall's 4.2.1 T • Model (Xi, Vi) iid random vectors • Effect parameter which is to be interpreted as follows: T >0 T = 0 T <0 the random variables X and Yare positively correlated the random variables X and Yare un correlated the random variables X and Yare negatively correlated • Hypothesis Ho: T = 0 • Statistic and Distribution is a measure of association between two rv. It can be visualised by the number of increasing and decreasing lines between two pairs of observations (Xi, Vi), (Xj, l'j). Thus it is related to the number T S = # {(i,j): i < j, (Xi - Xj)(Vi - l'j) > O} -# {(i,j): i < j, (Xi - Xj)(Vi - l'j) < O} 831 i. e. the difference between the number of "concordant pairs" and "discordant pairs". Then Kendall's T can be written as .~ = T :~ r T 2S n(n - I) and S are equivalent statistics (see e. g. [7]). The exact distribution of S (cf. [13]) is given by P(S where the = s) = 1rn 1rn(s) n! can be computed successively: 1rn+1(S) = with 1rn(s - n) + 1rn(s - n + 2) + 1rn(s - n 1r1(0) = I, 1r1(k) = 0 for k ¥= 0 + 4) + ... + 1rn(s + n - 2) + 1rn(s + n) For large samples one may use the following normal approximation: where. is the cdf of the standard normal distribution and " t q = Jvar(S)=JI~n(n-I)(2n+5) When ties of length t in one sample and of length u the other sample occur the following modified expression is used (cf. [8]): q2 118 {n(n -1)(2n + 5) + 9n( n _ :)( n _ 2) { + 2n( nl _ I) { ~t(t - ~ t( t - ~ t( t - i) } { 1)(2t + 5) - 1)( t - 2) } { ~ u( u - ~ u(u - ~ u( u - 1)(2u + 5)} 1)( u - 2) } I) } • Program Call %KENTAU ( <DATA = dataset ,> VARl = first variable, VAR2 = second variable <,BY = by variable> <, MIN _ASYM = lower bound for asymptotics > <,MAX_EXAC = upper bound for exact calc> ); where dataset = first variable = names the SAS data set containing the data to be analysed, default is _LAST_ (as in SAS) names the first response variable to be tested 832 , .... - -rj-, ;~· __ '';:r-_·~·"'__ _ - -" ·~'_~·J_' _.' __ - -~.::-:--:'::,,~. l .;< <:-.~-.:~. second variable = by variable = names the second response variable to be tested leads to seperate analyses on observations in groups defined by the specified by variable (as in, SAS) smallest sample size for asymptotic results, default is 10 largest sample size for exact results, default is 100 lower bound for asymptotics = upper bound for exact calc = 5 Closing Remarks The increasing popularity of exact nonparametric statistical methods has not yet led to their integration into SAS. Prior to the development of NPAR, the only way of compensating this deficiency has been to use other statistical software for exact nonparametrics (e. g. StatXact). Now, NPAR allows the user to perform the exact versions of the most commonly used non parametric tests within SAS. The main advantage of this integrated software solution relates to the avoidance of all problems in connection with data transfer to other statistical software systems. The comfortable user-interface makes NPAR easy to handle for an ordinary SAS user and renders any effort devoted to learning a new software superfluous. In addition, the implementation of NPAR in SAS /1 M L and the SAS macro language allows further extensions of the collection, which are planned for the future. References [1] A. R. Ansari and R. A. Bradley. Rank-sum tests for dispersions. Ann. Math. Statist., 31:1174-1189, 1960. [2] T. Bregenzer, O. Gefeller, and E. Brunner. Sas macros for exact methods in non-parametrical statistics. In Proceedings of the SAS European Users Group International Conference, pages 724-735. SAS Institute Inc., Heidelberg, 1992. [3] E. Brunner and D. Compagnone. Two sample rank tests for repeated observations for small sample sizes. Statistical Software Newsletter, 14:36-42, 1988. the distribution [4] E. Brunner and H. Dette. Rank procedures for the two-factor mixed model. JASA, 87:884-888, 1992. [5] E. Brunner and N. Neumann. Two-sample rank tests in general models. Biom.J., 28:395-402, 1986. [6] J. D. Gibbons. Nonparametric Statistical Inference. McGraw-Hili, New York, 1971. [7] J. Hajek and Z. Sidak. Theory of Rank Tests. Academic Press, New York, 1967. [8] M. G. Kendall. Rank Correlation Methods. Charles Griffin, London, fourth edition, 1975. [9] N. Neumann. Some procedures for calculating the distributions of elementary nonparametric teststatistics. Statistical Software Newsletter, 14:120-126, 1988. [10] R. H. Randles and D. A. Wolfe. Introduction to the Theory of Nonparametric Statistics. John Wiley, New York, 1979. 833 ~. __ ~.~:_-~, ~"~,~.'_ _ ,~_ --" , __ ._ J --" ... " ... ,. :-;_;.:J'-:_;-':_:.:":.~~"~.~.:f~:: -::._::"__ :;~::~-:._/:_-. - _ " _ r ~_. __ """",,===tW=-·~~>",~·t~'_n_c. , [11] SAS Institute Inc., Cary, NC, USA. SAS/STAT User's Guide, Version 6, 4 edition. [12] S. Siegel and J. W. Tukey. A nonparametric sum of ranks procedure for relative spread in unpaired samples. JASA, 55:429-445, 1960. corrected: JASA 56:1005 (1961). l t [13] G. P. Sillitto. The distribution of kendall's Biometrika, 34:36-40, 1947. T coefficient of rank correlation in rankings containing ties. [14] B. Streitberg and J. Rohmel. Exact calculations for permutation and rank tests: an introduction to some recently published algorithms. Statistical Software Newsletter, 12:10-17, 1986. SAS, SASjIMl, and SASjSTAT are registered trademarks of SAS Institute Inc., Cary, NC, USA. StatXact is a registered trademark of CYTEl Software Corporation, Cambridge, MA02139, USA. 834