Power comparison of nonparametric test for populations

advertisement
Power comparison of nonparametric test for
latent root of covariance matrix in two
populations
Shin-ichi Tsukada
Faculty of Physical Sciences and Engineering, Meisei University, 2-1-1 Hodokubo,
Hino, Tokyo 191-8506, Japan. tsukada@ge.meisei-u.ac.jp
1 Introduction.
We compare the actual significance levels and the powers of nonparametric test for
the hypothesis that the α-th largest latent root of covariance matrix for two populations is equivalent. In principal component analysis(PCA), the α-th largest latent
root of covariance matrix represents a contribution of the α-th principal component.
Although there are many books on PCA, we have hardly seen the hypothesis that
the latent root for two populations is equivalent. Sugiyama and Ushizawa [SU98]
propose a procedure applying Ansari-Bradley test which is testing the equivalence
of variances, and simulate the accuracy.
There are Kamat’s test, Jackknife test and Permutation test for the equivalence
of variances. We investigate the suitability for procedures applying Kamat’s test,
Jackknife test and Permutation test. We compare the actual significance levels and
the powers of the above procedure and our procedures, and show that the procedure
applying Permutation test is superior by simulation.
2 Test Procedure.
(g)
Suppose that {xi ; i = 1, . . . , ng }(g = 1, 2) are random observations from p-variate
(g)
population Λp (µg , Σ (g) ) with mean µg and covariance matrix Σ (g) . Let λα and
(g)
(g)
γ α be the α-th largest latent root and the latent vector corresponding to λα ,
respectively. We consider the following hypothesis:
(1)
(2)
H0 : λα = λα (= λα ),
(1)
(2)
H1 : λα 6= λα .
(g)
We calculate the sample mean x̄g , the sample latent root lα
(g)
latent vector hα . Let
(1)
′
Y i = h(1)
α (x i
Zj =
(2)
′
h(2)
α (x j
− x̄1 ) ≡ (yαi − ȳα ),
(i = 1, . . . , n1 ),
− x̄2 ) ≡ (zαj − z̄α ),
(j = 1, . . . , n2 ).
and the sample
1714
Shin-ichi Tsukada
Under the null hypothesis, the variance and the covariance of Y and Z are as follows:
Var[Y ] = λα + O(n−1
1 ),
Var[Z] = λα + O(n−1
2 ),
Cov[Y , Z] = 0.
The scores Y and Z have asymptotically the same distribution and do not correlate
each other.
Under the alternative hypothesis, the variance and the covariance are as follows:
−1
Var[Y ] = λ(1)
α + O(n1 ),
−1
Var[Z] = λ(2)
α + O(n2 ),
Cov[Y , Z] = 0.
The scores Y and Z do not have asymptotically the same distribution and do
not correlate each other. Therefore we deal with testing the above hypothesis as
the equivalence for variance of Y and Z. Sugiyama and Ushizawa [SU98] adopt
Ansari-Bradley test for the equivalence of variances. We propose procedures applying
Kamat’s test, Jackknife test and Permutation test.
Nonparametric test requires the independence of each sample. But yαi and yαj
are no longer independent, zαi and zαj are similar. Now we evaluate the degree of
dependence. We omit the suffix representing the population and let E(xi ) = 0 without less of generality. When Σ = diag(λ1 , . . . , λp ) and λk is simple, the covariance
of yαi and yαj is as follows:
"
E[yαi yαj ] = E
p
p
X
X
#
huα hvα xui xvj
u=1 v=1
=−
2
n21
1
+ 2
n1
+
+
1
n21
1
n21
p
X
l6=α
21
λ2lα κ21
αl καl −
p
X
l,u6=α
u6=l
p
X
1
n21
p
X
21
3 12
λ2uα (κ21
αu καu + κα καu )
u6=α
21
111 111
λuα λlα (κ21
ul καl + κulα κulα ) −
1
n21
p
X
21 21
λ2vα (κ3α κ12
αv + καv καv )
v6=α
111
21 21
λvα λlα (κ111
vlα κvlα + κvl καl )
v,l6=α
v6=l
p
X
12
111 111
−3
λuα λvα (κ12
)
αu καv + καuv καuv ) + O(n
(1)
u,v6=α
where λαβ = (λα − λβ )−1 . The third moments denote κ3i = E(xi xi xi ), κ21
ij =
111
E(xi xi xj ), κ12
ij = E(xi xj xj ) and κijk = E(xi xj xk ). Though we do not express
terms of higher order for the above expansion, the expansion consists of odd-order
moments. For a symmetric population,
E[yαi yαj ] = 0,
the degree of dependence is very weak. This expansion shows that the degree of
dependence is weak when the sample size is sufficiently large. Therefore, for large
sample we may ignore the influence of dependence. But the degree of dependence is
influenced by the third moments for an asymmetric population. This appears in the
simulation result.
Nonparametric test for latent root in two populations
1715
3 Simulation.
First we simulate the actual significance levels for multivariate normal population,
contaminated normal population and log-normal population. We set α = 1, the
sample size as n1 = n2 = 20, 50 and 100, and the number of simulation as a hundred
thousand. As the population we select multivariate normal distribution:
(g)
(g)
(g)
N (0, Σ3 ), N (0, Σ5 ) and N (0, Σ7 )
contaminated normal distribution:
(g)
(g)
(g)
(g)
0.4N (0, 2.0Σ3 ) + 0.6N (0, Σ3 ),
0.4N (0, 2.0Σ5 ) + 0.6N (0, Σ5 )
(g)
(g)
and 0.4N (0, 2.0Σ7 ) + 0.6N (0, Σ7 ),
and log-normal distribution:
LN (0, diag(1.1890, 0.6931, 0.4812)),
LN (0, diag(2.0449, 1.5110, 0.9406, 0.6932, 0.4812))
and LN (0, diag(2.1840, 1.9234, 1.5110, 1.2156, 0.9406, 0.6931, 0.4812)),
(g)
(g)
(g)
=
= diag(52.0, 16.0, 4.0, 2.0, 1.0), Σ7
= diag(7.5, 2.0, 1.0), Σ5
where Σ3
diag(70.0, 40.0, 16.0, 8.0, 4.0, 2.0, 1.0) and g=1,2.
Table1, Table2 and Table3 represent the actual significance levels for the normal
population, the contaminated normal population and the log-normal population,
respectively. We indicate A as Ansari-Bradley test, K as Kamat’s test, J as Jackknife
test, P1 as Permutation test by Good [GO00] and P2 as Permutation test by Aly
[AL90] in each Table. The number in parentheses is the standard error and the
significance level is 0.05.
Table 1. Actual Significance Levels (Normal Population, Significance Level 5%
n
A
K
J
P1
P2
3-dimension
20
50
100
.0444 .0484 .0501
(.0015) (.0019) (.0022)
.0203 .0369 .0237
(.0012) (.0014) (.0015)
.0424 .0482 .0490
(.0019) (.0026) (.0019)
.0428 .0474 .0489
(.0018) (.0015) (.0031)
.0697 .0662 .0620
(.0018) (.0018) (.0022)
5-dimension
20
50
100
.0447 .0492 .0488
(.0020) (.0027) (.0016)
.0187 .0369 .0239
(.0011) (.0015) (.0011)
.0417 .0477 .0487
(.0014) (.0018) (.0022)
.0426 .0462 .0494
(.0016) (.0018) (.0025)
.0695 .0648 .0622
(.0018) (.0025) (.0024)
7-dimension
20
50
100
.0298 .0412 .0449
(.0011) (.0020) (.0031)
.0108 .0314 .0220
(.0009) (.0011) (.0014)
.0215 .0354 .0436
(.0015) (.0012) (.0018)
.0257 .0370 .0448
(.0017) (.0015) (.0014)
.0488 .0550 .0580
(.0022) (.0019) (.0025)
As a whole, the actual significance levels of Kamat’s test are not satisfying. The
actual significance levels of Ansari-Bradley test, Jackknife test and Permutation test
1716
Shin-ichi Tsukada
Table 2. Actual Significance Levels (Contaminated Normal Population, Significance
Level 5%
n
A
K
J
P1
P2
3-dimension
20
50
100
.0488 .0501 .0502
(.0016) (.0023) (.0030)
.0180 .0366 .0221
(.0012) (.0021) (.0018)
.0513 .0554 .0517
(.0026) (.0022) (.0021)
.0407 .0469 .0488
(.0019) (.0022) (.0026)
.0648 .0641 .0609
(.0030) (.0023) (.0027)
5-dimension
20
50
100
.0469 .0501 .0504
(.0019) (.0015) (.0019)
.0182 .0343 .0241
(.0013) (.0010) (.0017)
.0498 .0539 .0533
(.0026) (.0022) (.0013)
.0402 .0460 .0474
(.0022) (.0021) (.0022)
.0643 .0633 .0609
(.0018) (.0024) (.0019)
7-dimension
20
50
100
.0367 .0468 .0482
(.0018) (.0022) (.0019)
.0081 .0240 .0185
(.0006) (.0016) (.0013)
.0255 .0344 .0421
(.0016) (.0023) (.0017)
.0227 .0333 .0415
(.0013) (.0014) (.0016)
.0458 .0519 .0548
(.0026) (.0024) (.0025)
Table 3. Actual Significance Levels (Log Normal Population, Significance Level 5%
n
A
K
J
P1
P2
3-dimension
20
50
100
.2976 .4531 .5462
(.0037) (.0052) (.0057)
.2652 .3997 .4637
(.0042) (.0063) (.0032)
.0629 .0777 .0863
(.0016) (.0020) (.0031)
.0167 .0285 .0387
(.0013) (.0018) (.0020)
.0680 .0533 .0543
(.0025) (.0024) (.0015)
5-dimension
20
50
100
.3506 .5309 .6356
(.0065) (.0042) (.0050)
.3121 .4082 .4893
(.0045) (.0033) (.0080)
.0551 .0669 .0726
(.0018) (.0028) (.0027)
.0104 .0184 .0242
(.0009) (.0009) (.0015)
.0404 .0306 .0301
(.0016) (.0016) (.0010)
7-dimension
20
50
100
.2703 .4274 .5214
(.0045) (.0059) (.0044)
.2221 .2619 .3358
(.0041) (.0057) (.0048)
.0324 .0387 .0434
(.0009) (.0017) (.0023)
.0048 .0083 .0121
(.0007) (.0011) (.0010)
.0280 .0227 .0225
(.0015) (.0009) (.0019)
by Good are satisfying in normal population. In contaminated normal population
the ones of Ansari-Bradley test converge to 0.05 faster than the ones of Jackknife
test and Permutation test by Good. In log-normal population, the ones of AnsariBradley test are not satisfying and the ones of Permutation test by Aly are satisfying
slightly.
Because of the covariance (1) of yαi and yαj , these results are expected. The
expansion for the covariance of yαi and yαj consists of odd-order moments, and oddorder moments are zero in a symmetric population. Therefore, when the population
is symmetric the degree of dependency for each score is weak. The large sample size
is necessary to reduce the degree of dependency, when the population is asymmetric.
The convergence to 0.05 of the actual significance levels depends on the proportion
of the latent root lα .
Next, we investigate the power of tests by simulation when the sample size is 20,
50 and 100. We make the alternative hypotheses as follows:
Nonparametric test for latent root in two populations
1717
√
(2)
(1)
H3i : Σ3 = diag(7.5, 2.0, 1.0), Σ3 = diag(7.5 − δ3i / n2 , 2.0, 1.0),
(1)
H5i : Σ5 = diag(52.0, 16.0, 4.0, 2.0, 1.0),
√
(2)
Σ5 = diag(52.0 − δ5i / n2 , 16.0, 4.0, 2.0, 1.0),
(1)
H7i : Σ7 = diag(70.0, 40.0, 16.0, 8.0, 4.0, 2.0, 1.0),
√
(2)
Σ7 = diag(70.0 − δ7i / n2 , 40.0, 16.0, 8.0, 4.0, 2.0, 1.0), (i = 1, 2).
As simulation for the actual significance levels, we set α = 1, the sample size as
n1 = n2 = 20, 50 and 100, and the number of simulation as a hundred thousand. We
substitute (δ31 , δ32 , δ51 , δ52 , δ71 , δ72 ) for (15, 20, 60, 80, 100, 130), (20, 30, 150, 200,
150, 200) and (30, 40, 250, 300, 200, 270) in n2 =20, 50 and 100, respectively. The
upper part of Table 4 and Table 5 represents the power of test for the alternative
hypotheses H31 , H51 and H71 , and the lower part does the power for the hypotheses
H32 , H52 and H72 . Table 4 and Table 5 represent the power of test in normal
population and contaminated normal population, respectively.
Table 4. Powers of Test (Normal Population, Significance Level 5%
n
A
K
J
P1
P2
A
K
J
P1
P2
3-dimension
20
50
100
.1367 .2340 .4924
(.0029) (.0032) (.0083)
.0835 .2047 .3061
(.0021) (.0044) (.0051)
.1782 .3449 .6917
(.0025) (.0054) (.0057)
.2753 .4674 .7965
(.0063) (.0058) (.0029)
.3412 .5152 .8165
(.0065) (.0049) (.0036)
5-dimension
20
50
100
.0659 .2762 .6986
(.0022) (.0042) (.0048)
.0337 .2374 .4895
(.0024) (.0027) (.0048)
.0718 .4027 .8843
(.0026) (.0041) (.0029)
.1233 .5275 .9364
(.0038) (.0042) (.0019)
.1728 .5752 .9434
(.0044) (.0050) (.0022)
7-dimension
20
50
100
.0463 .1060 .2026
(.0018) (.0035) (.0030)
.0195 .0871 .1073
(.0014) (.0020) (.0042)
.0422 .1375 .3024
(.0024) (.0037) (.0051)
.0851 .2302 .4312
(.0037) (.0052) (.0062)
.1361 .2818 .4698
(.0024) (.0051) (.0072)
.2472
(.0054)
.1755
(.0027)
.3462
(.0048)
.4796
(.0093)
.5419
(.0083)
.0877
(.0026)
.0481
(.0025)
.1038
(.0037)
.1752
(.0043)
.2338
(.0050)
.0609
(.0021)
.0274
(.0017)
.0619
(.0030)
.1196
(.0036)
.1791
(.0026)
.5724
(.0037)
.5294
(.0042)
.7804
(.0025)
.8628
(.0033)
.8802
(.0021)
.8185
(.0047)
.6259
(.0051)
.9560
(.0018)
.9796
(.0013)
.9816
(.0009)
.7633
(.0049)
.7302
(.0056)
.9348
(.0023)
.8179
(.0033)
.8423
(.0035)
.8942
(.0028)
.7285
(.0042)
.9845
(.0012)
.9936
(.0009)
.9943
(.0009)
.1631
(.0042)
.1371
(.0023)
.2320
(.0044)
.3544
(.0071)
.4125
(.0066)
.3413
(.0051)
.1902
(.0063)
.5172
(.0069)
.6521
(.0059)
.6851
(.0061)
As a whole, the power of Kamat’s test is smaller than the other powers because
the actual significance levels of Kamat’s test are not satisfying. Kamat’s test is not
useful.
Because the actual significance levels of Ansari-Bradley test, Jackknife test and
Permutation test are comparatively preserved, we compare the powers of them.
There is a similar tendency in normal polulation and contaminated-normal population. The two powers of Permutation test are a close value in the alternative
1718
Shin-ichi Tsukada
Table 5. Powers of Test (Contaminated Normal Population, Significance Level 5%
n
A
K
J
P1
P2
A
K
J
P1
P2
3-dimension
20
50
100
.1245 .2070 .4220
(.0032) (.0052) (.0068)
.0485 .1134 .1577
(.0019) (.0028) (.0028)
.1198 .2265 .4771
(.0038) (.0047) (.0059)
.1864 .3217 .6043
(.0044) (.0061) (.0056)
.2572 .3915 .6707
(.0038) (.0055) (.0064)
5-dimension
20
50
100
.0641 .2372 .6151
(.0023) (.0042) (.0068)
.0241 .1235 .2467
(.0014) (.0034) (.0034)
.0655 .2538 .6725
(.0024) (.0040) (.0037)
.0929 .3662 .7880
(.0029) (.0046) (.0029)
.1387 .4428 .8378
(.0029) (.0059) (.0030)
7-dimension
20
50
100
.0508 .1032 .1856
(.0031) (.0027) (.0038)
.0112 .0404 .0456
(.0009) (.0029) (.0027)
.0340 .0783 .1704
(.0013) (.0039) (.0037)
.0563 .1380 .2687
(.0022) (.0041) (.0041)
.1043 .1981 .3372
(.0031) (.0048) (.0057)
.2148
(.0042)
.0894
(.0028)
.2000
(.0048)
.3107
(.0045)
.4003
(.0037)
.0834
(.0026)
.0311
(.0019)
.0818
(.0027)
.1240
(.0035)
.1798
(.0043)
.0623
(.0036)
.0147
(.0009)
.0424
(.0013)
.0742
(.0027)
.1313
(.0033)
.4996
(.0063)
.2760
(.0044)
.5320
(.0055)
.6696
(.0033)
.7431
(.0029)
.7450
(.0038)
.3399
(.0046)
.8005
(.0036)
.8863
(.0035)
.9213
(.0031)
.6893
(.0053)
.3920
(.0049)
.7080
(.0048)
.6098
(.0049)
.6912
(.0056)
.8325
(.0040)
.4020
(.0062)
.8737
(.0034)
.9378
(.0017)
.9609
(.0019)
.1508
(.0039)
.0574
(.0033)
.1208
(.0042)
.2048
(.0046)
.2808
(.0052)
.3028
(.0022)
.0749
(.0041)
.2845
(.0039)
.4194
(.0063)
.5030
(.0051)
hypothesis that the powers are large. They are larger than the power of AnsariBradley test. The power of Jackknife test is larger than the powers of Permutation
tests under H52 and n1 = n2 = 50. But the powers of Permutation tests are larger
than the powers of Ansari-Bradley test and Jackknife test in other wide alternative
hypothesis. By simulation, we find that Permutation test by Good is superior to
Ansari-Bradley test in a symmetric population.
4 Conclusions.
In this paper, we show the actual significance levels and powers of procedures applying nonparametric test that is the equivalence of variances for two populations.
By simulation, the test applying Ansari-Bradley test, Jackknife test and Permutation test may be useful for a symmetric population. One of the reasons may be
that the degree of dependency for each score is weak under a symmetric population.
We may recommend to use the procedure applying Permutation test from results of
the power comparison. But we need to simulate in various situations. For example,
when α = 2, when the sample size is different and when the other latent roots except
λα are different, and so on.
For an asymmetric population, all tests are not useful. It is necessary to develop
a new method that is applicable for an asymmetric population.
Nonparametric test for latent root in two populations
1719
5 Appendix.
Kamat’s test
Let {x1 , . . . , xm } and {y1 , . . . , yn } be random samples from two populations. We
permute all samples, it assume that
a
a′
b
b′
:
:
:
:
the
the
the
the
number
number
number
number
of
of
of
of
y
x
y
x
that is
that is
that is
that is
larger than xmax ,
larger than ymax ,
smaller than xmin ,
smaller than ymin .
We test the equivalence of variance using S = a + b − (a′ + b′ ).
Jackknife test
Let
x̄i =
m
X
xk
m−1
k6=i
and Di2 =
m
X
(xk − x̄i )2
m−2
k6=i
be the sample average and the sample variance for the sample
{x1 , . . . , xi−1 , xi+1 , . . . , xm }. In the same fashion, let
ȳi =
n
X
yk
n−1
k6=i
and Ei2 =
n
X
(yk − ȳi )2
n−2
k6=i
be the sample average and the sample variance for the sample
{y1 , . . . , yi−1 , yi+1 , . . . , yn }. Let
"
m
X
(xk − x̄0 )2
S0 = log
k=1
"
T0 = log
m−1
n
X
(yk − ȳ0 )2
k=1
n−1
#
,
Si = log Di2
(i = 1, . . . , m),
x̄0 =
m
X
xk
k=1
#
,
Tj = log Ej2
(j = 1, . . . , n),
ȳ0 =
m
n
X
xk
k=1
n
,
.
Compute
Aj = mS0 − (m − 1)Si ,
(i = 1, . . . , m),
Bj = nT0 − (m − 1)Tj ,
(j = 1, . . . , n).
(2)
We test the equivalence of variance using
Ā − B̄
,
Q= √
V1 + V2
where
Ā =
m
X
Ai
i=1
m
,
V1 =
m
X
(Ai − Ā)2
i=1
m(m − 1)
,
B̄ =
n
X
Bj
j=1
n
,
V2 =
n
X
(Bj − B̄)2
j=1
n(n − 1)
.
The criterion Q is asymptotically distributed as the standard normal distribution.
1720
Shin-ichi Tsukada
References
[AL90]
[MI68]
[SU98]
[GO00]
[HW99]
[PE01]
Aly, EE. AA.: Simple test for dispersive ordering, Stat. Prob. Lett., 9,
323–325 (1990)
Miller, R.G., Jr.: Jackknifing variances, Ann. Math. Stat., 38, 567–582
(1968)
Sugiyama, T. and Ushizawa, K. : A non-parametric method to test equality of intermediate latent roots of two populations in a principal component analysis, J. Japan Statist. Soc., 28, 227–235 (1998)
Good, PI.: Permutation Tests: A Practical Guide to Resampling Methods
for Testing Hypotheses. Springer-Verlag, New York (2000).
Hollander, M. and Wolfe, D.: Nonparametric Statistical Methods. Wiley,
New York (1999)
Pesarin, F.: Multivariate Permutation Tests: With Applications in Biostatistics. Wiley, New York (2001)
Download