Test of mean difference in longitudinal data based on block resampling Hirohito Sakurai1 and Masaaki Taguri2 1 2 Division of Computer Science, Hokkaido University, Sapporo 060–0814, JAPAN sakurai@main.eng.hokudai.ac.jp Research Division, National Center for University Entrance Examinations, Tokyo 153–8501, JAPAN taguri@rd.dnc.ac.jp Summary. This paper proposes a testing method for detecting the difference of two means or mean curves in longitudinal data using a block resamplingapproach with permutation analogy For the detection of mean difference, we consider four types of test statistics. Monte Carlo simulations are carried out in order to examine the sizes and powers of the proposed test. Key words: block resampling, longitudinal data 1 Introduction 1 2 Suppose that there are two samples given by {Yi (t)}qi=1 and {Xj (t)}qj=1 for t = 1, . . . , n, and assume that they are mutually independent, where q1 and q2 are numbers of subjects, and n is the number of observed points. We also assume that, for fixed t, Yi (t) and Xj (t) are independent over q1 and q2 subjects, respectively. Then we consider the model ( Yi (t) = p1 (t) + εi (t), Xj (t) = p2 (t) + ηj (t), i = 1, . . . , q1 , j = 1, . . . , q2 , (1) where p1 (t) and p2 (t) are unknown regression functions, and εi (t) and ηj (t) are the error terms having means 0 and finite variances. More general formulations for (1) can be found, for example, in [HH90], [FS04], and references therein. Our problem is then to test ( H0 : p1 (t) = p2 (t) for all t, H1 : p1 (t) 6= p2 (t) for some t, (2) where H0 and H1 are the null and alternative hypotheses, respectively. For example, Fig. 1 shows the wind velocity data which are obtained by an artificial satellite (left panel) and a radar (right panel) on the earth, and are measured at altitudes from 80 km to 90 km every 1 km (11 subjects) during (n =)13 days. For these data, we want to know whether the mean behavior of the two devices in 1088 Hirohito Sakurai and Masaaki Taguri 100 120 80 60 0 20 40 Wind Velocity (m/s) 80 60 40 0 20 Wind Velocity (m/s) 100 120 measuring wind velocity is equal or not. Then the problem is formulated as (2) and the significant difference between them is detected by some methods, which is briefly explained in Section 4. 0 2 4 6 8 10 12 14 0 2 4 Day 6 8 10 12 14 Day Fig. 1. Wind velocity data (left: satellite, right: radar) Several methods on the comparison of two means or regression curves have been proposed, and most of them assume that the error terms are independent and identically distributed (i.i.d.), whose distributions are normal. However, when we analyze a real dataset, it may be unrealistic. If we cannot assume the normality, the nonparametric approach, for example by [BY96], is available. Another possible approach would be an application of resampling. In some settings of dependent data, for example, in time series analysis, the naive bootstrap usually fails to capture the dependent structure of data because of ignoring the order of observations. To overcome this problem, two kinds of bootstrap methods are proposed. One is a model-based approach which resamples from approximately i.i.d. residuals, and the other is a nonparametric, purely model-free bootstrap scheme, which resamples from blocks of observations; see [B02], [HHK03], [L03], and references therein. This paper concerns with the case where longitudinal datafrom two groups are given, and proposes a testing method for detecting the difference between two means or two mean curves in longitudinal data based on block resamplingpproach. As seen in the numerical examinations given below, our approach is superior to [BY96] in power. The rest of the paper is constructed as follows. Section 2 proposes a testing procedure which generates the resamples corresponding to two samples by permutation analogy and calculates a p-value (achieved significance level). In order to investigate the sizes and powers of the proposed testing method, Monte Carlo simulations are carried out in Section 3, and some concluding remarks are summarized in Section 4. 2 Testing Method There are some approaches to detecting the difference between two mean curves, p1 (t) and p2 (t), in (1). In this paper, our interest concentrates on the behavior of four types of test statistics given below. The following statistic is proposed by [HH90]: Test of mean difference in longitudinal data based on block resampling 2n−1 j+g X X Sn = Sn (D1 , . . . , Dn ) = 4 j=0 P t=j+1 !2 3 " n−1 X (Dt+1 − Dt )2 #−1 5 n Dt , 2 t=1 1089 (3) P where Dt = Yt − Xt for t = 1, . . . , n or Dt = Yt−n − Xt−n for t = n + 1, . . . , n + g, 1 2 Yt = qi=1 Yi (t)/q1 , Xt = qj=1 Xj (t)/q2 , g = [np] is the integer part of np, and p is a tuning constant satisfying 0 < p < 1 which is determined by the fully data-driven approach; the second approach described in [HH90, pp.1043–1044]. The statistic (3) is essentially based on kernel estimators of p1 (t) and p2 (t). As another type of test statistics, we can consider T1n = T1n (D1 , . . . , Dn ) = n X |Dt |, (4) Dt2 . (5) t=1 n T2n = T2n (D1 , . . . , Dn ) = X t=1 R In addition to (3), (4) and (5), we here also focus attention on area-difference |p1 (t) − p2 (t)| dt, and then consider the following test statistic T3n = A = T3n (D1 , . . . , Dn ) as an estimator of A constructed by the trapezoidal rule with linear interpolations of adjacent observation values: T3n = T3n (D1 , . . . , Dn ) = 1 2 X n−1 (|Dt | + |Dt+1 |)I{Dt Dt+1 ≥ 0} t=1 + 1 2 X |Dt |2 + |Dt+1|2 n−1 t=1 |Dt | + |Dt+1 | I{Dt Dt+1 < 0}, (6) where I{·} is the indicator function. The values of Sn and Trn (r = 1, 2, 3) may be small when H0 is true, and large when H0 is false. Therefore, the above four statistics enable us to measure the discrepancy between p1 (t) and p2 (t). In this section, we propose a nonparametric testing method for the problem (2) using (3), (4), (5) and (6). We call our testing procedure stated below “Mixed Moving Block Resampling (MMBR) Test.” The main ideas of our testing method are that we generate blocks of observationsn each sample similar to the moving block bootstrap( [K89]), and draw resamples without replacement from the mixed blockshich are constructed by two samples. This is motivated from the technique that can reflect the null hypothesis by resampling from a combined sample. The idea of combining observations of two samples and drawing resamples with replacement from the combined sample is previously considered by [BJV89] and [WT96]. The former is the test of homogeneity of scale, and the latter is that of equality of two means. For the mean difference in longitudinal data, [ST05] proposes the test which generates the resamples with replacement from mixed blocks. For simplicity, let T be a generic notation for Sn , T1n , T2n or T3n . For a given significance level α, the proposed testing algorithm together with Monte Carlo method is described as follows. 1. Calculate tobs = T (Y, X) = T (D1 , . . . , Dn ). 2. Put Cy,t = Yt − Ȳ and Cx,t = Xt − X̄, where Ȳ = n t=1 Xt /n. P Pn t=1 Yt /n and X̄ = 1090 Hirohito Sakurai and Masaaki Taguri 3. Divide {Cy,1 , . . . , Cy,n } and {Cx,1 , . . . , Cx,n } into k(= n − l + 1) successive overlapping blocks with each length l, and put the collection of blocks ξy = {ξy,1 , . . . , ξy,k } and ξx = {ξx,1 , . . . , ξx,k }, where ξy,t = {Cy,t , . . . , Cy,t+l−1 } and ξx,t = {Cx,t , . . . , Cx,t+l−1 } (t = 1, . . . , k), respectively. 4. Combine ξy and ξx , and put ξpooled = {ξy,1 , . . . , ξy,k , ξx,1 , . . . , ξx,k }. ∗b ∗b ∗b ∗b 5. Draw 2m blocks, ξy∗b = {ξy,1 , . . . , ξy,m } and ξx∗b = {ξx,1 , . . . , ξx,m }, without ∗b ∗b ∗b replacement from ξpooled to obtain resamples Y = {Y1 , . . . , Yn } and X ∗b = {X1∗b , . . . , Xn∗b } (b = 1, . . . , B), where m = n/l (if [n/l] is an integer) or m = [n/l] + 1 (otherwise), and [n/l] is the integer part of a real n/l. 6. Calculate t∗b = T (Y ∗b , X ∗b ) = T (D1∗b , . . . , Dn∗b ). 7. Repeating steps 5 and 6 an appropriate number of times B, calculate t∗1 , . . . , t∗B . 8. From steps 1 and 7, approximate the achieved significance levelby ASL = B ∗b ≥ tobs }/B, and reject H0 when ASL ≤ α. b=1 I{t P d d Note that if l = 1, this reduces to permutation test. 3 Numerical Examination We conduct Monte Carlo simulations to investigate the size and power properties of our testing procedure proposed in Section 2. The step 5 of the testing algorithm generates resamples without replacement, however we also make the comparisons of resampling algorithms between “with” and “without” replacement in order to examine the difference of the results. Further, we conduct Bowman and Young’s test for unpaired data ( [BY96]; hereafter termed “BY” for short) in order to clarify the properties of our test. In the level and power studies, the nominal level is α = 0.05 and 0.10. All our results are based on independent 2000 pairs of two samples, {Yi (t)} and {Xj (t)}, where B = 2000 replications of resampling are applied to every two samples in our test. Naturally, the same initial samples are used for the comparisons. We generate initial samples according to (1) whose means are specified by p1 (t) = 0 and p2 (t) = c, where c = 0, 0.2, 0.4, 0.6, 0.8, 1.0. The case of c = 0 or c 6= 0 corresponds to the null hypothesis or the alternative hypothesis being true. The values, q1 , q2 and n, are (q1 , q2 ) = (10, 10), (10, 20), (10, 30), (20, 20), (20, 30), (30, 30) and n = 10. The size of n may be small, however it is useful in practice to examine such a behavior of similar settings for the real data described in Section 1. As for the error terms εi (t) and ηj (t), if p1 (t) and p2 (t) explain most of the correlation structure contained in the data, then we may consider that the errors are nearly i.i.d., or very weak dependency exists in εi (t) and ηj (t). Thus, we choose the following Gaussian AR(1) errors, εi (t) = φεi (t − 1) + zi(t) and ηj (t) = φηj (t − 1) + zj (t), where i.i.d. i.i.d. zi (t) ∼ N (0, τ12 ), zj (t) ∼ N (0, τ22 ), φ = 0, ±0.1, ±0.2, τ12 = τ22 = (1−φ2 )V(εi (t)), and V(εi (t)) = 1, 2, 3, 4, 5. The computation has been carried out for all combinations of these parameters, however, to save space, we restrict ourselves to discussing the case of α = 0.05 and V(εi (t)) = 3. In the MMBR test, it contains a parameter l, namely block length, to be estimated. Since it is preferable that the empirical level is nearly equal to the nominal level α, our choice of l is then the length where the empirical level is close to α. If there are some candidates which have the same level errors, we make the conservative choice, viz., we choose the block length such that the empirical level is less than Test of mean difference in longitudinal data based on block resampling 1091 the nominal level. Further if there are some candidates whose empirical levels are equal, we select the length where the empirical power is the highest among them. The resulting block lengths are given in Table 1. Table 1. Optimum block length for α = 0.05 and V(εi (t)) = 3 (q1 , q2 ) = (10, 10) (q1 , q2 ) = (10, 20) (q1 , q2 ) = (10, 30) φ −0.2 −0.1 0 0.1 0.2 −0.2 −0.1 0 0.1 0.2 −0.2 −0.1 0 0.1 0.2 (a) T1 5 4 2 1 1 5 5 2 1 1 5 5 3 2 1 T2 4 3 2 1 1 5 4 3 1 1 5 4 3 1 1 T3 4 2 1 3 3 4 2 1 3 3 5 3 1 1 2 Sn 2 2 1 1 2 2 2 1 1 2 2 2 1 1 1 (b) T1 5 4 4 4 2 5 5 3 3 3 5 5 2 3 3 T2 4 2 3 3 3 5 2 3 3 3 5 4 3 2 3 T3 4 2 4 4 3 2 1 3 3 4 5 2 3 3 3 Sn 2 1 1 1 2 2 2 1 1 2 2 1 1 1 1 (q1 , q2 ) = (20, 20) (q1 , q2 ) = (20, 30) (q1 , q2 ) = (30, 30) φ −0.2 −0.1 0 0.1 0.2 −0.2 −0.1 0 0.1 0.2 −0.2 −0.1 0 0.1 0.2 (a) T1 5 4 3 1 1 5 5 3 2 2 5 5 3 2 1 T2 5 3 2 1 1 5 4 3 1 1 5 3 2 1 1 T3 4 3 1 2 2 4 3 1 2 2 4 3 1 1 2 Sn 3 2 1 1 3 3 1 1 1 3 2 2 1 2 2 (b) T1 5 4 2 2 2 5 5 3 3 3 4 4 2 2 2 T2 5 2 3 3 3 5 2 2 3 3 4 3 2 2 2 T3 4 3 3 3 3 4 2 3 3 3 4 2 2 2 3 Sn 2 2 1 1 2 3 1 1 1 1 2 1 1 1 2 (a): Resampling with replacement, (b): Resampling without replacement Now, let us first summarize the results of the level studies. The empirical level of BY and our tests are given in Table 2. BY test seems to have a tendency to keep the nominal level, however it underestimates the nominal level except for the case of (q1 , q2 ) = (10, 10). We have also observed that the empirical levels of Trn (r = 1, 2, 3) and Sn keep the nominal level α, and that Trn (r = 1, 2, 3) needs longer block length than Sn to keep the nominal level as is shown in Table 1. Most of the resulting block length for Sn is 1 or 2, while those for T1n , T2n and T3n are greater than or equal to 3. On the other hand, the comparison of resampling methods between “with” and “without” replacement shows the similar behavior. For φ < 0, both methods have a tendency to keep the nominal level, while the level error for φ ≥ 0 seems to be slightly larger when resampling is done without replacement. Next, we discuss the power studies. Since we found similar tendencies among the six cases of (q1 , q2 ), we show the results for (q1 , q2 ) = (20, 20) with φ = 0, 0.1, −0.2. The empirical power seems to be affected by the numbers of q1 and q2 rather than the variance of noise. The increase of noise variance causes the decrease of empirical 1092 Hirohito Sakurai and Masaaki Taguri Table 2. Empirical level (α = 0.05 and V(εi (t)) = 3) (q1 , q2 ) φ (10, 10) −0.2 −0.1 0 0.1 0.2 (10, 20) −0.2 −0.1 0 0.1 0.2 (10, 30) −0.2 −0.1 0 0.1 0.2 (20, 20) −0.2 −0.1 0 0.1 0.2 (20, 30) −0.2 −0.1 0 0.1 0.2 (30, 30) −0.2 −0.1 0 0.1 0.2 T1n 0.049 0.048 0.050 0.058 0.077 0.040 0.049 0.051 0.051 0.071 0.036 0.045 0.050 0.055 0.066 0.041 0.050 0.051 0.052 0.070 0.037 0.048 0.049 0.056 0.076 0.036 0.047 0.053 0.053 0.071 with replacement T2n T3n Sn 0.050 0.047 0.059 0.046 0.048 0.064 0.053 0.062 0.054 0.060 0.099 0.075 0.084 0.129 0.092 0.050 0.046 0.054 0.047 0.057 0.062 0.050 0.053 0.053 0.054 0.089 0.078 0.070 0.126 0.097 0.045 0.052 0.047 0.049 0.044 0.056 0.048 0.055 0.048 0.050 0.084 0.068 0.072 0.116 0.089 0.050 0.046 0.054 0.043 0.050 0.051 0.051 0.058 0.051 0.062 0.086 0.075 0.082 0.115 0.096 0.046 0.045 0.050 0.051 0.047 0.038 0.051 0.059 0.058 0.057 0.084 0.077 0.078 0.114 0.100 0.045 0.048 0.045 0.046 0.056 0.053 0.045 0.057 0.059 0.061 0.084 0.077 0.075 0.114 0.092 without replacement T1n T2n T3n Sn 0.046 0.051 0.044 0.063 0.046 0.050 0.052 0.034 0.054 0.064 0.074 0.057 0.069 0.083 0.095 0.076 0.092 0.112 0.129 0.095 0.038 0.046 0.046 0.059 0.049 0.046 0.056 0.068 0.045 0.055 0.068 0.056 0.059 0.069 0.090 0.080 0.092 0.092 0.126 0.103 0.035 0.045 0.049 0.053 0.046 0.048 0.049 0.035 0.049 0.053 0.064 0.048 0.067 0.073 0.095 0.068 0.089 0.092 0.122 0.094 0.038 0.049 0.044 0.048 0.048 0.051 0.050 0.054 0.057 0.060 0.071 0.053 0.070 0.081 0.093 0.076 0.094 0.104 0.120 0.100 0.035 0.047 0.045 0.054 0.046 0.052 0.048 0.038 0.051 0.064 0.063 0.059 0.062 0.072 0.087 0.078 0.083 0.092 0.117 0.103 0.036 0.043 0.045 0.051 0.044 0.049 0.049 0.042 0.052 0.060 0.069 0.060 0.065 0.076 0.095 0.083 0.087 0.104 0.126 0.095 BY 0.067 0.056 0.054 0.054 0.056 0.037 0.038 0.037 0.038 0.037 0.041 0.043 0.042 0.042 0.042 0.021 0.021 0.025 0.025 0.025 0.029 0.029 0.025 0.025 0.026 0.022 0.022 0.019 0.019 0.020 power as a whole, however the power properties among the five variances given above are nearly the same each other. Thus, we choose and discuss the case where V(εi (t)) = 3. Fig. 2 shows the empirical powers corresponding to the four test statistics, (3) – (6), and BY test. The figure shows that the empirical power of T3n is most powerful among them, and that the relationship among powers corresponding to our and BY tests is given by T3n ≥ T2n ≥ T1n ≥ Sn ≥ BY. This indicates the numerical superiority of our test using the four statistics in power. Especially, the numerical superiority of T3n in power can be confirmed from Fig. 2. As the number of subjects increases, the empirical power is improved. Within the values of 0 ≤ c ≤ 1, the empirical power of T1n is nearly equal to that of T2n , however T2n is slightly Test of mean difference in longitudinal data based on block resampling 1093 0.6 0.8 1.0 0.2 1.0 0.8 0.4 1.0 1.0 1.0 0.4 0.6 0.8 1.0 0.6 0.8 1.0 T_{1n} T_{2n} T_{3n} Sn BY 0.8 power 0.6 0.0 0.2 power 0.8 T_{1n} T_{2n} T_{3n} Sn BY 0.0 0.4 0.2 (iii) φ = −0.2 0.2 0.4 0.0 0.2 0.0 c 0.4 1.0 0.8 T_{1n} T_{2n} T_{3n} Sn BY 0.6 0.8 (ii) φ = 0.1 0.2 power 0.6 c (i) φ = 0 0.0 0.6 power 0.2 0.0 c 0.6 0.4 0.4 0.2 0.0 0.2 0.0 0.0 T_{1n} T_{2n} T_{3n} Sn BY 0.4 1.0 0.8 T_{1n} T_{2n} T_{3n} Sn BY 0.6 power 0.6 0.4 0.0 0.2 power 0.8 T_{1n} T_{2n} T_{3n} Sn BY 0.4 1.0 higher than T1n . The results of “with” and “without” resampling were nearly equal to each other. 0.0 c (iv) φ = 0 0.2 0.4 0.6 c (v) φ = 0.1 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 c (vi) φ = −0.2 Fig. 2. Empirical power for (q1 , q2 ) = (20, 20) For each combination of q1 and q2 , the panels (i)–(iii) are obtained by resampling with replacement and the ones (iv)–(vi) are by resampling without replacement. 4 Concluding Remarks In this paper we have proposed the testing method for two means in longitudinal data based on the block resampling technique with permutation analogy. Our numerical studies indicate the applicability of MMBR test for weakly dependent data even when the sample size is very small. In some cases the effectiveness of application of block resampling could be confirmed. Applying our test with every possible block length to the data given in Fig. 1, we have observed that there is a possibility of the significant difference between the satellite and radar in measuring wind velocity, which is also supported by the result of [BY96]. However, the problem on block length selection in the block resampling is very important, and the development of a fully data-driven approach to selecting block length in MMBR test will be needed for practical data analyses. References [BJV89] Boos, D., Janssen, P. and Veraverbeke, N.: Resampling from centered data in the two sample problem. J. Statist. Plan. Inf., 21, 327–345 (1989) 1094 [BY96] Hirohito Sakurai and Masaaki Taguri Bowman, A. and Young, S.: Graphical comparison of nonparametric curves. Appl. Statist., 45, 83–98 (1996) [B02] Bühlmann, P.: Bootstraps for time series. Statist. Sci., 17, 52–72 (2002) [FS04] Ferreira, E. and Stute, W.: Testing for differences between conditional means in a time series context. J. Amer. Statist. Assoc., 99, 169–174 (2004) [HH90] Hall, P. and Hart, J. D.: Bootstrap test for difference between means in nonparametric regression. J. Amer. Statist. Assoc., 85, 1039–1049 (1990) [HHK03] Härdle, W., Horowitz, J. and Kreiss, J.-P.: Bootstrap methods for time series. Internat. Statist. Rev., 71, 435–459 (2003) [K89] Künsch, H. R.: The jackknife and the bootstrap for general stationary observations. Ann. Statist., 17, 1217–1241 (1989) [L03] Lahiri, S. N.: Resampling Methods for Dependent Data. Springer, New York (2003) [ST05] Sakurai, H. and Taguri, M.: Test for difference of two means in longitudinal data using moving block bootstrap. Proc. 5th IASC Asian Conference on Statistical Computing, 139–142 (2005) [WT96] Wang, J. and Taguri, M.: Bootstrap method — an introduction from a two sample problem (in Japanese). Proc. Inst. Statist. Math., 44, 3–18 (1996)