Cover letter: Dear Editor, We thank you and the reviewers for the constructive and helpful comments and suggestions. We have revised our manuscript according to these comments and suggestions. The abstract has been structured into Background, methods, Results, and Conclusion. We added Author’s contribution. The following is a point-to point response to issues raised by the reviewers. Sincerely, Guimin Gao University of Alabama at Birmingham. Response to Reviewer Dr. Hon Keung Tony Ng Issue 1: The method of Wang et al. (2004) can be problematic because the numbers X1 and X2 are not independent when the threshold (tau) is set to be the 90-th percentile. The authors may need to justify the use of the test procedure for two independent binomial proportions. Moreover, the tests considered in the manuscript (as well as Wang et al.) are only few of the existing test procedures for this purpose. One may refer to Newcombe (1998) and some other subsequent work for reference. To address this issue we have added text highlighted in yellow in section of Method, subsection of Development of the Tests (page 5). Our response to the independence issue raised by Dr. Ng is threefold. First, in our manuscript, we discuss and evaluate the method of Wang et al. (2004) because we compare our new method to the method of Wang et al. for the purpose of evaluating our new method. Moreover, the method of Wang et al. has been used by several investigators for lifespan studies in the fields of aging making it especially pertinent. Second, the method of Wang et al. uses the test procedures for two independent binomial proportions described by Mehrotra et al. 2003 and these procedures require that X1 and X2 are independent, where, Xj (j=1,2) is the number of observation exceeding threshold tau in the j-th group. We acknowledge that in the method of Wang et al. X1 and X2 are not independent when the threshold (tau) is set to be the 90-th percentile, and that this creates a theoretical problem. Nevertheless, on an empirical level, our simulations show that in the sample sizes we considered, this is not an apparent problem because the method of Wang et al. has high power and can control type I error quite well in the simulation studies using sample sizes that are practical for lifespan studies. We have mentioned this in the manuscript. When an assumption is not strictly met, simulation studies (including estimation of power and type I error) are a reasonable way to evaluate the methods as we have done in this manuscript. Further, that this theoretical concern seems not to be of practical import is bolstered by the fact the this procedure, if one replaces the 90th with the 50th percentiles, is just the ‘old fashioned medians test’(Siegel and Castellan Jr. 1988), advocated and used in this manner without concern. (Siegel S, Castellan Jr. N J: Nonparametric Statistics for The Behavioral Sciences. McGraw-Hill College, 1988, p 200) Third, although X1 and X2 are not independent if we pick a sample quantile as a threshold, this is only one way to proceed and we have also offered results when the threshold is set in advance according to prior knowledge (in this situation, X1 and X2 are independent). We have mentioned this in the manuscript. With respect to the work of Newcombe (1998) and the fact that more test procedures for two independent binomial proportions can be found in Newcombe (1998), we acknowledge the point and believe that studying the performance of tests offered by Newcombe will be a valuable endeavor for future research. Yet, no one paper can study everything and we believe that introducing the tests of Newcombe herein is beyond the scope of the current manuscript. Issue 2: In the simulation study, the underlying value of the threshold is assumed to be known and set to be 130. In real-life situations, this threshold is usually unknown and it plays an important role in the performance of the test procedures studies in the manuscript. One may consider a simulation study to investigate the effect of different choices of the threshold value. Dr. Ng raises an important opportunity for us to clarify this point. We must respectfully disagree with Dr. Ng on one aspect. Specifically, as people who participate extensively in model organism longevity research, we can say with confidence that in real-life situations, one usually does know the threshold of interest a priori. If one is studying mice, rats, dogs, fruit flies, and many other species, enough is known about their lifespans from past research that one could easily choose a threshold of interest a priori. By definition, we most often study the most commonly studied organisms and for these organisms we do have a very good idea of the threshold we would find of interest to define ‘old’. However, we do recognize that we will not have such knowledge in all cases. It is for this reason that when analyzing the simulated data, we also consider a threshold of the 90th percentile of the data allowing for an ad hoc data-based determination of a threshold. We now clarify this on page 9 (section of Delineation of Tests to Be Evaluated.)of the revised manuscript. Issue 3. 3. A. When defining the new variable Zi = I (Yi > tau)Yi, if there is a large number of lifespan < tau, it may raise problems to the rank-based (Wilcoxon-MannWhitney test) and permutation test since large number of Zi are equal to zero. On the other hand, I suspected that the power of the test may not be good when the number of Yi > tau are small in either the treatment group or control group. We acknowledge that if the observed data satisfy H0,A (i.e., the proportions of subjects with Yi > tau in the two groups are equal), the new methods (tests for H0,C ) may have lower power than the tests for H0,B if large number of Zi are equal to zero in any group. We have showed this in our simulation studies (see Table 3) in the manuscript. If the observed data don’t satisfy H0,A, then the group with longer lifespan (group1) usually has higher proportion of subjects with Yi > tau than the other group (group2). Therefore group1 has lower proportion of zero in the Z values than group2. Usually this will increase the power of the new methods (tests for H0,C) compared to the tests for H0,B . This is confirmed by our simulation studies (see Table 4). In brief, no test can have excellent power in all situations, especially those situations in which departures from the null hypothesis are very small. 3.B. Here is an example: Let both the treatment group and control group have 10 subjects, 6 (9) subjects in treatment (case) group have lifespan < 130, and we have the following Z values: Treatment: 0, 0, 0, 0, 0, 0, 190, 190, 190, 190 Control: 0, 0, 0, 0, 0, 0, 0, 0, 0, 200 It is clear that treatment group has a longer lifespan, but the (new) test for H0,C does not correctly reject the null hypothesis of two groups having equal maximum lifespan. Let us try to offer an alternative view and reasoning here. Dr. Ng states “It is clear that treatment group has a longer lifespan.” This statement is assuredly true in the sample data provided if one defines ‘longer lifespan’ on the basis of a mean, median, or proportion above a lifespan of say, 180. On the other hand, the statement is false if one defines ‘longer lifespan’ on the basis of the proportion above a lifespan of say, 195. Moreover, one does not perform inferential tests to draw conclusions about samples. One conducts inferential tests to draw conclusions about populations from samples and it is not at all clear that the treatment population has a longer lifespan than the control population from the data provided. Hence the basic premise of the argument that this example is intended to support does not hold up and the conclusion that “the (new) test for H0,C does not correctly reject the null hypothesis” is not supported. For the example above, both the traditional method (Wilcoxon-Mann-Whitney) for test of H0,B and the method for test of H0,C can not reject the null hypothesis. We cannot conclude that the new method for test of H0,C has lower power for the example. That is, we do not know that the null hypothesis should have been rejected. Issue 4. Please explain clearly how the permutation test is done. We thank Dr. Ng for pointing out the need for greater clarity here. To address this issue we have added paragraph (on page 10) highlighted in yellow (in section of Method, subsection of Delineation of Tests to Be Evaluated). Suppose in the observed dataset, treatment (control) group has n1 (n2) subjects. In the permutation test, we put all the (n1 +n2) subjects together, and then generate 1000 replicated datasets. For each replicated dataset, we randomly sample n1 subjects from the (n1 +n2) subjects and assign them to treatment group, and assign the left n2 subjects to control group. We run a T-test on the observed data and the 1000 replicated datasets. Let T0 be the T value for the observed dataset, then p-value is calculated as the proportion of replicated datasets with absolute T values greater than or equal to absolute valued of T0. Issue 5. In Table 2, the authors mentioned “The bolded values are those simulated type I error rate which are significantly higher than the nominal alpha level” Please give an explicit definition on “significant higher”. It is confusing that 0.014 is bolded but 0.017 and 0.016 are not bolded in the same column (for N=50, alpha = 0.01). We have added definition under the Table 2: The bolded values are those simulated type I error rates which are significantly higher than the nominal at the 2tailed 95% confidence level (i.e., the lower bound of confidence interval is higher than ). Note that for the permutation tests we used 1000 replicated datasets and for other tests we used 5000 replicated datasets. Why 0.014 is bolded but 0.017 and 0.016 are not bolded is that we used 1000 (5000) replicated datasets for the permutation tests (other tests). Different number of replicated datasets can affect the estimation of confidence intervals. Response to Reviewer Dr. Huixia Wang Issue 1: Note that E(Z) = P(Y | Y > tau)E(Y | Y > tau). Therefore the new test for H0,C is really testing whether P(Y > tau | X = 1)E(Y | Y> tau, X=1) = P(Y > tau | X = 0)E(Y | Y> tau, X=0), while the method for H0,B is testing whether E(Y | Y> tau, X=1) = E(Y | Y> tau, X=0) and the method for H0,A is testing whether P(Y > tau | X = 1) = P(Y > tau | X = 0). The difference of E(Z) between groups consists of two parts: the probability of Y exceeding tau, and the expectation of Y for the subpopulation exceeding tau. The difference in the first part is the focus of H0,A while the difference in the second part is the interest of H0,B . The defined Z is like zero-inflated data. Dominici and Zeger (2005) studied a similar problem. The author may consider discussing such connections in the paper. These comments are very useful for us explain our methods theoretically. According to the comments, we have added text (highlighted in yellow on pages 7-8, in subsection of Development of the Tests) to describe the property of test for H0,C and discuss connections among the tests for H0,A , H0,B and H0,C . We have discussed the connections between the method of Dominici and Zeger (2005) and our methods. The method of Dominci and Zeger (2005) estimates the mean difference of nonnegative random variables (Y) for two groups. Our methods test the mean difference of random variables (Y) which are greater than threshold (more details in the manuscript). Issue 2. Page 8, line 15. It seems surprising to me that the method for testing H0,B has larger Type I error than the new method for testing H0,C in simulation 1. We checked the results for simulation 1 (in Table 2) and realized that the type I error of the new method for testing H0,C is comparable to that of the method for testing H0,B . I have stated this in the manuscript (highlighted in yellow on page 11, in section of Results), and deleted the original statement “the new methods for testing H0,C have lower type I error rates that the corresponding methods for tests of H0,B .” Issue 3. The discussion of real data analysis is too brief. The author may consider providing some graphics, e.g., histograms to help readers understand the sources of differences between two groups. Quantities such as the proportion of observations exceeding tau and some estimation of E(Y|Y>tau) in each group may also be useful for demonstration. We have added histograms (Figures 3 and 4) to describe the two real datasets. We added descriptions (highlighted in yellow on pages 13 in section of ILLUSTRATION WITH REAL DATA): From Figure 3, we can see the upper tails of the histograms of the two groups are different. Similar results can be found in Figure 4. Also we added descriptions about the proportions of the observations greater than , the sample means of the observations greater than , and the sample means of the Zvalues in the two corresponding groups (see text highlighted in yellow on pages 13-14 in section of ILLUSTRATION WITH REAL DATA). Issue 4. Table 2. The confidence intervals for Type I errors should be included. We have added 95% confidences interval for type I errors in Table 2. Issue 5. Table 1. The parameters in simulation 1 are exactly the same as in simulation 3. So are the plots. We have added a proportion parameter rj (j = 0, 1). In simulation 1, r1 = 0.9 and in simulation 3, r1 = 0.8. The corresponding plots are different (see Table 1). Issue 6. Page 5, last line. “sample mean” of Z should be population mean of Z. We have changed the words “sample mean” into population mean Issue 7. Page 9, line 18. “90th percentile” should be the 90th percentile We have added “the” before 90th percentile.