Methods Estimation of unpaired SSMD SSMD has recently been proposed for measuring the magnitude of difference between two populations [1]. Let random variables P1 and P2 denote two populations of interest and D denote the difference between P1 and P2. Suppose P1 has mean 1 and variance 12 , and P2 has mean 2 and variance 22 . The covariance between these two populations is 12 . Then SSMD (denoted as ) is defined as the ratio of mean D to standard deviation D of the D D difference D, namely 1 2 . When two populations are independent, we are 12 22 2 1 2 interested in unpaired difference between the two populations. The SSMD corresponding to unpaired difference is called “unpaired SSMD”, which is 12 2 2 . If the two independent 1 2 populations have equal variances (namely 12 22 2 ), then 1 2 2 2 . SSMD defined above is a population parameter which needs to be estimated from observed samples. Suppose we have one sample (with sample size n1 , sample mean X 1 and sample standard deviation s1 ) from Population P1 and another independent sample (with n 2 , X 2 and s2 ) from Population P2. Let N = n1 + n2. Zhang [1] derived maximum-likelihood estimate (MLE) and method-of-moment (MM) estimate of unpaired SSMD when the two compared groups have normal distributions with unequal variances. When the two compared groups have normal distributions with equal variance, the uniformly minimal variance unbiased estimate (UMVUE) of unpaired SSMD [2] is, 2 N 2 ˆ UMVUE , K 2 N23 N 3.5 when n1 2, n2 2 . 2 2 2 K ( n1 1) s1 ( n 2 1) s 2 2 It is well-known that if two random variables X and U are independently distributed with X ~ N ( , 1) and V ~ 2 ( p) , then the ratio Y VX/ p has a noncentral t-distribution with p X1 X 2 degrees ( n1 1) s12 2 X 1 of freedom ( n2 1) s22 2 X2 Therefore, ~ ( N 2) X 1 X2 ( n1 1) s12 2 (n 1 X noncentrality ( n2 1) s22 2 1 1 n2 1 n1 We 2 2 n12 1) s (n2 1) s 2 1 . 1 n2 , 1 . ) ( N 2) X2 1 n1 2 1 1 n1 n 2 ( ) 1 n1 parameter X 1 X 2 ~ N 1 2 , ( ) and ( n11 n12 ) 2 ~ N ( namely T and 2 2 2 ~ noncentral t ( N 2, ( N 2) 2 1 1 n1 n 2 ), ~ noncentral t ( N 2, 2 1 1 n1 n 2 ). know , that namely In primary HTS experiments, n1 1 for most investigated siRNAs. s12 does not exist when X 11 X 2 where n1 1. In this case, the UMVUE of unpaired SSMD is then ˆ UMVUE 2 2 ( n 1 ) s 2 2 K 2 ( 2 ) K 2 n22 2 n2 2.5 ( 2 ) n 1 . We know Therefore, T X 11 X2 (1 ) s2 2 ~ 2 (n2 1) (1 n12 ) 2 ~ N X 11 X 2 ~ N 1 2 , (1 n12 ) 2 , namely X 11 X 2 1 n2 ( n2 1) s22 that ~ noncentral t (n2 1, 2 1 n1 and 2 1 n1 ,1 . 2 ) . 2 If set (n1 1) s12 0 when n1 1, then for both n1 1 and n1 2 (i.e., n1 1 ), T (n 1 X2 1 n1 1 n2 2 2 1 1) s1 ( n 2 1) s 2 ( N 2) ̂ UMVUE b X 2 1 1 n1 n 2 2 K (n 1 X1 X 2 K 2( N 2) 2 ( ) kT where K 2 ( N23 ) N 3.5 , N 2 , 2 N 2 1) s12 (n2 1) s 22 and k ~ noncentral t ( , b ) and ( n11 n12 ) . Estimation of paired SSMD When two populations are correlated, we are usually interested in paired difference. The SSMD corresponding to paired difference is called “paired SSMD”. Suppose we observe n pairs of samples, ( X 11 , X 21 ), ( X 12 , X 22 ), , ( X 1n , X 2 n ) from populations P1 and P2 respectively. Let Dj be the difference between the jth pair of samples, namely D j X 1 j X 2 j . Let D and s D be the sample mean and sample standard deviation of D respectively, namely D n 1 n D j and s D2 j 1 n 1 n 1 (D j 1 j D ) 2 . Assume that D is normally distributed, namely D ~ N ( D , ) . Then the MM, MLE and UMVUE of the paired SSMD ( n1 ) ̂ MM sDD , ˆ MLE nn1 sDD and ˆUMVUE n2 2 n21 sDD respectively. The proof of ML and ( 2 ) MLE is trivial. The proof of UMVUE is as follows. 2 D When D ~ N ( D , D2 ) , there are the following properties: ( D , s D2 ) is a complete sufficient statistic of ( D , D2 ) ; D and s D2 are independent with each other; and 2 (n 1) . Based on these properties, we have n 1 ( n 2 2 ) 1 1 1 2x x 12 2 E x e dx and n 1 n 1 (n 1) s 2 2 0 n 1 2 ( ) 2 ( ) 2 2 D D 2 ( n 1) s D2 D2 is distributed to ( n 1 ) ( n 1 ) 2 D 1 2 E n 2 2 2 E D E n 2 ( 2 ) (n 1) s 2 2 (n 1) s D2 ( 2 ) D D D n 1 n 1 ( ) ( ) D n 2 2 n21 sDD . Then ˆ is Set ˆ n 2 2 2 ( 2 ) ( 2 ) (n 1) s D2 D . D a function of the complete sufficient statistic ( D , s D2 ) and ˆ is an unbiased estimate of . Thus, ˆ is a UMVUE of . D j ’s are independently distributed with N ( D , D2 ) , so D ~ N ( D , 1n D2 ) , namely nD D ( n 1) s D2 D2 ~ 2 (n 1) and ~ N ( n , 1) . Therefore, nD D ( n 1) s D2 D2 Let T k (n 1) nD sD ( n21 ) ( n2 2 ) ~ noncentral t (n 1, n ) , namely nD sD ~ noncentral t (n 1, n ) . . Then T ~ noncentral t ( , b ) and ̂UMVUE kT where n 1 , b n and 2 n ( n 1) . Confidence interval of SSMD estimates Based on the estimates of SSMD and their distributions derived above, we have T ~ noncentral t ( , b ) and ̂UMVUE kT for both unpaired and paired SSMDs although n1 n2 2 , b 2 1 1 n1 n 2 and k ( n1 n22 2 ) ( n1 n2 3 2 ) 1 n1 n2 2 ( n11 n12 ) in unpaired SSMD and ( ) 2 in paired SSMD. Let Ft ( ,b ) () be the n ( n 1) ( ) cumulative distribution function of noncentral t ( , b ) and Tobs be the observed value of T . Because T ~ noncentral t ( , b ) , we can find L and U such that Ft ( ,b L ) (Tobs ) 1 2 and Ft ( ,b u ) (Tobs ) 2 . Then ( L , U ) is a 1 confidence interval n 1 , b n and k n 1 2 n2 2 of SSMD. The variance of a noncentral t ( , ) is ( 21 ) 2 2 2 ( ) (b ) 2 . Using ˆUMVUE kT , the variance of 2 2 2 ( 1 ) is Var ( ˆUMVUE ) k 2 Var T k 2 2 2 2 (2 ) b 2 2 . 2 variance of T is ̂UMVUE ( 21 ) 2 2 2 2 ( ) . Thus, the 2 2 False negative rate and restricted false positive rate Let us focus on the situation where we want to select siRNAs with large positive effects, namely the siRNAs with c1 where denotes SSMD and c1 is the preset lowest value for large effects. In this situation, the FNR is the probability that we conclude c1 whereas actually c1 . The maximum FNR in a decision rule is called false negative level (FNL). Traditionally, the false positive rate is the probability that we conclude c1 whereas actually c1 . However, in RNAi HTS experiments, scientists are usually interested in controlling the probability of concluding c1 given c2 where 0 c2 c1 . This probability is called restricted false positive rate (RFPR) [2, 3]. The maximum RFPR in a decision rule is called restricted false positive level (RFPL). For example, for an observed SSMD value obs ( obs 0 ), if we select all the siRNAs with ˆ as hits, the FNR in this process (for c c 0 ) is obs 1 2 FNR Pr ˆ obs | c1 Pr T k | c1 Pr t ( , b ) kobs | c1 Ft ( ,bc1 ) kobs and the RFPR in this process is RFPR Pr ˆ obs | c2 Pr T kobs | c2 Pr t ( , b ) kobs | c2 1 Ft ( ,bc2 ) obs ; thus, FNL and RFPL in this process are Ft ( ,bc1 ) and 1 F obs k t ( ,bc2 ) obs k respectively. obs k Similarly, for an observed SSMD value obs ( obs 0 ), if we select all the siRNAs with ˆ obs as hits, FNL and RFPL in this process (for c1 c2 0 ) are 1 Ft ( ,bc1 ) kobs and Ft ( ,bc2 ) kobs respectively. Consequently, when we use SSMD-based ranking method for selecting siRNAs with a large positive value, in the process that we select all the m siRNAs with ˆ * ( * 0 ) as hits, the FNL and RFPL are Ft ( ,bc1 ) k* and 1 Ft ( ,bc2 ) k* respectively; when we use SSMD-based ranking method for selecting siRNAs with a large negative value, in the process that we select all the m siRNAs with ˆ * ( * 0 ) as hits, FNL and RFPL are 1 Ft ( ,bc1 ) k* and Ft ( ,bc2 ) k* respectively (Selection Criteria Ia and IIa in Table 1). Hit selection using SSMD-based testing methods Based on T ~ noncentral t ( , b ) and ˆ kT , we can determine a selection criterion so that a specific FNL or RFPL can be achieved. To select siRNAs with large positive effects, namely the siRNAs with c1 ( c1 0 ), the following decision rule (namely Selection Criterion Ib in Table 1) achieves FNL to be . if ˆ kQt ( ,bc1 ) ( 1 ) declare a hit , , where Qt ( ,bc1 ) ( 1 ) is the Decision Rule Ib : ˆ kQ not declare a hit , if ( ) t ( ,bc1 ) 1 1 quantile of t ( , bc1 ) . The reason is as follows. The FNR for Decision Rule Ib is the ( ) (i.e., not declaring a hit) given c . Hence, probability that ˆ kQ t ( ,bc1 ) 1 FNR Pr ˆ kQt ( ,bc1 ) ( 1 ) | c1 Pr t ( , b ) Q Pr t ( , bc ) Q Pr T Qt ( ,bc1 ) ( 1 ) | c1 1 t ( ,bc1 ) ( 1 ) | c1 t ( ,bc1 ) ( 1 ) 1 1 Therefore, FNL when using Selection Criterion Ib. Using Decision Rule Ib, the RFPR with respect to (w.r.t.) c2 and c1 (0 c2 c1 ) is RFPR Pr ˆ kQ ( ) | c t ( ,bc1 ) Pr t ( , b ) Q Pr t ( , bc ) Q Q 1 F 1 2 Pr T Qt ( ,bc1 ) ( 1 ) | c 2 ( ) ( ) t ( ,bc1 ) ( 1 ) t ( ,bc1 ) 2 t ( ,bc2 ) t ( ,bc1 ) 1 1 , where Ft ( ,bc2 ) is the cumulative distribution function of t ( , bc2 ) . Therefore, RFPL 1 Ft ( ,bc2 ) Qt ( ,bc1 ) ( 1 ) when using Selection Criterion Ib. Similarly, we obtain Selection Criteria Ic, IIb and IIc and their FNLs and RFPLs listed in Table 1. References 1. Zhang XD: A pair of new statistical parameters for quality control in RNA interference high-throughput screening assays. Genomics 2007, 89:552-561. 2. Zhang XD: A new method with flexible and balanced control of false negatives and false positives for hit selection in RNA interference high-throughput screening assays. Journal of Biomolecular Screening 2007, 12:645-655. 3. Zhang XD, Ferrer M, Espeseth AS, Marine SD, Stec EM, Crackower MA, Holder DJ, Heyse JF, Strulovici B: The use of strictly standardized mean difference for hit selection in primary RNA interference high-throughput screening experiments. Journal of Biomolecular Screening 2007, 12:497-509.