nonparametric tests of independence and

advertisement
NONPARAMETRIC TESTS OF INDEPENDENCE AND DISPERSION
EXTENSIONS TO A COLLECTION OF SAS MACROS FOR
EXACT NONPARAMETRICS
Lars Pralle!, Thomas Bregenzer2, Olaf Gefeller!, Edgar Brunner!
! Abteilung Medizinische Statistik, Universitat Gottingen
2 Institut fiir Biometrie and Epidemiologie, Tierarztliche Hochschule Hannover
1
Introduction
Nonparametric methods have become a tool frequently used in many fields of statistical analysis (e. g.
biostatistics). In the NPAR1WAY and UNIVARIATE procedures, SAS provides a selection of useful nonparametric tests with decision rules based on approximations for large sample sizes (mostly the normal
approximation). In practice, however, small sample sizes are often encountered and the validity of the
approximations may be doubtful, especially when ties occur in small samples. In these situations the
knowledge of the exact distribution of the non parametric test statistic is essential to ensure the validity of
the statistical analysis.
On the 1992 SEUGI conference in Vienna, Bregenzer, Gefeller and Brunner [2] presented an easy-to-handle
SAS-macro package which covered many standard testing problems in one and two sample location designs.
The exact distributions of the test statistics are computed by shift algorithms (see e. g. [9], [14]). These
algorithms allow the computation for growing sample sizes with a reasonable amount of time and computer
resources.
The conception of this macro-collection (now called NPAR) allows to extend it to other statistical settings.
The existing collection provided tests for one and two sample location designs, namely the Sign test,
the Wilcoxon Signed Rank test for one sample problems, the Wilcoxon-Mann-Whitney test (extended to
repeated measurements per case) for independent two sample location problems, the Wilcoxon Signed Rank
test (extended to repeated measurements per case), and a two sample test with repeated measurements
in an unbalanced design.
In the extended version of NPAR, the following new situations are covered:
• correlation of two variables
- Kendall's
T
and a test of association based on it
• two sample scale problems
- Ansari-Bradley test
- Siegel-Tukey test
826
The new macros are embedded within the existing architecture of NPAR and use, as far as possible, the
same symbolic names. The user-interface was designed to make the macros easy to work with to anyone
familiar with SAS procedures.
In the following, we recall briefly the features of the macro collection and give a description of the syntax
of the new macros.
2
Notation
• Xi, l'i are random variables (rv), F, G denote the corresponding cumulative distribution functions
( cdf).
• limiting processes, for example (Xn
-+
X), are always meant as limn_co.
• Tn ~ N(O, 1) means convergence in distribution under Ho to N(O, 1).
• iid means independent identically distributed.
• w.l.o.g. means without loss of generality.
• #{ ...} denotes the number of elements of a set.
3
Old Macros
The spectrum of nonparametric methods covered by NPAR in the former version is delineated below.
• One Sample Problems
- Sign test (module %SIGN1)
- Wilcoxon Signed Rank test (module %W/LCl)
• Independent Two Sample Problems
- Conventional Wilcoxon-Mann-Whitney test (module %W/LC21)
- Extension of the Wilcoxon-Mann-Whitney test to deal with repeated observations per case
[3, 5] (module %W/LC21)
• Paired Two Sample Problems
- Extended Wilcoxon ranksum test with repeated measurements per case (module %W/L C20)
- Two sample test with repeated measurements in an unbalanced design [4] (module %BROE20)
827
4
New Macros
4.1
Two-Sample Dispersion Problems
4.1.1
Ansari-Badley Test
• Model
Xli=O'lfi+JL, i=1, ... ,m
X 2j = 0'2fj+m + JL, j = 1, .. . ,n
f j iid rv with 0 median.
Note that the two groups of rv are assumed to have the same location parameter JL.
• Effect parameter
,
0'2
:=
-
0'1
which is to be interpreted as follows:
, <1
, = 1
, >1
the distribution of the Xli is broader,
both distributions have the same dispersion,
the distribution of the X 2j is broader.
• Hypothesis
Ho:, = 1
• Statistic and Distribution
To compute the Ansari- Bradley statistic special scores are attributed to the observations:
Let Z = (Xu, ... , X 1m , X 21l ••• , X 2n ) denote the combined samples.
Sample
if N = m
Z:
Z(1)
Z(2)
1
2
N
"2
"2
lY.
- 1
2
1
2
+ n is even and
Sample
Zi:
Z(1)
Z(2)
Z((N-1)/2)
Z((N+1)/2)
Z((N+3)/2)
1
2
!f.:=1.
lY.±l
N-l
2
Scores Ai:
if N = m
N
lY.
- 1
2
+n
2
2
2
1
is odd.
The test statistic of the Ansari-Bradley test is defined as the sum of scores in one (w.l.o.g. the first)
sample [1]:
m
W = LAli
;=1
The exact distribution of W can be obtained from the following recurrence relation:
rm-1,n(J - k)
rm.n(W)
with
k= [N :
+ rm,n-1(J)
1],
[.J
denoting the integer function.
828
The probability of W
sample is then
=W
given m observations in the first sample and n observations in the second
Pm,n(W) =
For large samples the following normal approximation is used:
W - E(W)
JVar(W)
.!!.!4 N(O
1)
,
,
where
m(m~n+2)
E(W)
{
m(m + n
+ 1)2
4(m + n)
Var(W)
= {
if m
+n
if m
+ nis odd
is even .
mn(m + n + 2)(m + n - 2)
48(m + n - 1)
mn( m + n + 1) [3 + (m + n )2]
48(m + n)2
if m
+ n is even
if m
+n
.
is odd
if F is continuous and
mn
[16 it tjr] -
(m + n)(m + n +
2)2]
16(m+n)(m+n+ 1)
Var(W)
mn [16(m + n)
it
tjr] - (m
if m
+n
is even
if m
+n
is odd
+ n + 1)4]
16(m + n)2(m + n - 1)
if ties are present.
Where in the latter case 9 denotes the number of tied groups of observations in the combined sample
having length tj and being assigned the average score rj. Untied observations are considered as ties
of length 1.
• Program Call
%ANSBRA (
<DATA = dataset ,>
VAR = variable
<,BY = by variable>
<, MIN _ASYM = lower bound for asymptotics >
<, MAX _ EXAC
= upper bound for exact calc>
);
where
dataset =
names the SAS data set containing the data to be analysed,
default is _LAST_ (as in SAS)
829
variable =
by variable =
lower bound for asymptotics
=
upper bound for exact calc =
4.1.2
names the response variable to be tested (as in SAS)
leads to seperate analyses on observations in groups
defined by the specified by variable (as in SAS)
smallest sample size for asymptotic results,
default is 10
largest sample size for exact results,
default is 100
Siegel-Tukey Test
• Model
X li
X 2j
fi
=O'lfi+JL,
i=1, ... ,m
j = 1, ... ,n
= 0'2 f j+m +JL,
iid rv with zero median.
• Effect parameter
which is to be interpreted as follows:
,<1
,=1
,>1
the distribution of the Xli is broader,
both distributions have the same dispersion.
the distribution of the X 2j is broader.
• Hypothesis
Ho : , = 1
• Statistic and Distribution
The scores for the Siegel-Tukey- Test are given following this sceme:
(Xu, ... , X 1m , X 2b •• • , X 2n ) denote the combined samples.
Let Z
=
1
4
5
8
7
6
3
2
+
if N = m
n is even; if N is odd. the median of the sample is dropped and the same procedure is
applied to the reduced sample.
The test statistic of the Sigel-Tukey test is defined as the sum of scores in one (w.l.o.g. the first)
sample [12]:
m
S=I: S1i
i=l
As this test is based on a permutation of the Wilcoxon-scores. S has the same distribution under H 0
as the statistic of the Wilcoxon-Mann-Whitney test - as implemented in the module WILC21 (cf.
[6]).
830
• Program Call
%SIETUK (
<DATA
= dataset ,>
VAR = variable
<,BY = by variable>
<, MIN _ASYM = lower bound for asymptotics >
<, MAX _ EXAC = upper bound for exact calc>
);
where
dataset =
variable =
by variable =
lower bound for asymptotics
upper bound for exact calc
4.2
=
=
names the SAS data set containing the data to be analysed,
default is _ LAST_ (as in SA 5)
names the response variable to be tested (as in sAs)
leads to seperate analyses on observations in groups
defined by the specified by variable (as in SA 5)
smallest sample size for asymptotic results,
default is 10
largest sample size for exact results,
default is 100
Two-Sample Correlation Problems
Kendall's
4.2.1
T
• Model
(Xi, Vi) iid random vectors
• Effect parameter
which is to be interpreted as follows:
T
>0
T
= 0
T
<0
the random variables X and Yare positively correlated
the random variables X and Yare un correlated
the random variables X and Yare negatively correlated
• Hypothesis
Ho:
T
= 0
• Statistic and Distribution
is a measure of association between two rv. It can be visualised by the number of increasing
and decreasing lines between two pairs of observations (Xi, Vi), (Xj, l'j). Thus it is related to the
number
T
S =
# {(i,j): i < j, (Xi - Xj)(Vi - l'j) > O}
-# {(i,j): i < j, (Xi - Xj)(Vi - l'j) < O}
831
i. e. the difference between the number of "concordant pairs" and "discordant pairs". Then Kendall's
T can be written as
.~
=
T
:~
r
T
2S
n(n - I)
and S are equivalent statistics (see e. g. [7]).
The exact distribution of S (cf. [13]) is given by
P(S
where the
= s) =
1rn
1rn(s)
n!
can be computed successively:
1rn+1(S) =
with
1rn(s - n) + 1rn(s - n + 2) + 1rn(s - n
1r1(0) = I, 1r1(k) = 0 for k ¥= 0
+ 4) + ... + 1rn(s + n -
2) + 1rn(s + n)
For large samples one may use the following normal approximation:
where. is the cdf of the standard normal distribution and
"
t
q
= Jvar(S)=JI~n(n-I)(2n+5)
When ties of length t in one sample and of length u the other sample occur the following modified
expression is used (cf. [8]):
q2
118 {n(n -1)(2n + 5) + 9n( n _ :)( n _ 2) {
+ 2n( nl _ I) {
~t(t -
~ t( t -
~ t( t -
i) } {
1)(2t + 5) -
1)( t - 2) } {
~ u( u -
~ u(u -
~ u( u -
1)(2u + 5)}
1)( u - 2) }
I) }
• Program Call
%KENTAU (
<DATA = dataset ,>
VARl = first variable,
VAR2 = second variable
<,BY = by variable>
<, MIN _ASYM = lower bound for asymptotics >
<,MAX_EXAC = upper bound for exact calc>
);
where
dataset =
first variable =
names the SAS data set containing the data to be analysed,
default is _LAST_ (as in SAS)
names the first response variable to be tested
832
, ....
-
-rj-,
;~·
__ '';:r-_·~·"'__
_
-
-"
·~'_~·J_'
_.' __
-
-~.::-:--:'::,,~.
l
.;<
<:-.~-.:~.
second variable =
by variable =
names the second response variable to be tested
leads to seperate analyses on observations in groups
defined by the specified by variable (as in, SAS)
smallest sample size for asymptotic results,
default is 10
largest sample size for exact results,
default is 100
lower bound for asymptotics =
upper bound for exact calc =
5
Closing Remarks
The increasing popularity of exact nonparametric statistical methods has not yet led to their integration
into SAS. Prior to the development of NPAR, the only way of compensating this deficiency has been to use
other statistical software for exact nonparametrics (e. g. StatXact). Now, NPAR allows the user to perform
the exact versions of the most commonly used non parametric tests within SAS. The main advantage of
this integrated software solution relates to the avoidance of all problems in connection with data transfer
to other statistical software systems. The comfortable user-interface makes NPAR easy to handle for an
ordinary SAS user and renders any effort devoted to learning a new software superfluous.
In addition, the implementation of NPAR in SAS /1 M L and the SAS macro language allows further extensions
of the collection, which are planned for the future.
References
[1] A. R. Ansari and R. A. Bradley. Rank-sum tests for dispersions. Ann. Math. Statist., 31:1174-1189,
1960.
[2] T. Bregenzer, O. Gefeller, and E. Brunner. Sas macros for exact methods in non-parametrical statistics.
In Proceedings of the SAS European Users Group International Conference, pages 724-735. SAS
Institute Inc., Heidelberg, 1992.
[3] E. Brunner and D. Compagnone. Two sample rank tests for repeated observations for small sample sizes. Statistical Software Newsletter, 14:36-42, 1988.
the distribution
[4] E. Brunner and H. Dette. Rank procedures for the two-factor mixed model. JASA, 87:884-888, 1992.
[5] E. Brunner and N. Neumann. Two-sample rank tests in general models. Biom.J., 28:395-402, 1986.
[6] J. D. Gibbons. Nonparametric Statistical Inference. McGraw-Hili, New York, 1971.
[7] J. Hajek and Z. Sidak. Theory of Rank Tests. Academic Press, New York, 1967.
[8] M. G. Kendall. Rank Correlation Methods. Charles Griffin, London, fourth edition, 1975.
[9] N. Neumann. Some procedures for calculating the distributions of elementary nonparametric teststatistics. Statistical Software Newsletter, 14:120-126, 1988.
[10] R. H. Randles and D. A. Wolfe. Introduction to the Theory of Nonparametric Statistics. John Wiley,
New York, 1979.
833
~.
__
~.~:_-~, ~"~,~.'_ _ ,~_
--"
,
__
._
J
--"
... " ... ,.
:-;_;.:J'-:_;-':_:.:":.~~"~.~.:f~::
-::._::"__ :;~::~-:._/:_-. - _ " _ r
~_.
__
"""",,===tW=-·~~>",~·t~'_n_c.
,
[11] SAS Institute Inc., Cary, NC, USA. SAS/STAT User's Guide, Version 6, 4 edition.
[12] S. Siegel and J. W. Tukey. A nonparametric sum of ranks procedure for relative spread in unpaired
samples. JASA, 55:429-445, 1960. corrected: JASA 56:1005 (1961).
l
t
[13] G. P. Sillitto. The distribution of kendall's
Biometrika, 34:36-40, 1947.
T
coefficient of rank correlation in rankings containing ties.
[14] B. Streitberg and J. Rohmel. Exact calculations for permutation and rank tests: an introduction to
some recently published algorithms. Statistical Software Newsletter, 12:10-17, 1986.
SAS, SASjIMl, and SASjSTAT are registered trademarks of SAS Institute Inc., Cary, NC, USA.
StatXact is a registered trademark of CYTEl Software Corporation, Cambridge, MA02139, USA.
834
Download