file - BioMed Central

advertisement
Methods
Estimation of unpaired SSMD
SSMD has recently been proposed for measuring the magnitude of difference between two
populations [1]. Let random variables P1 and P2 denote two populations of interest and D
denote the difference between P1 and P2. Suppose P1 has mean 1 and variance  12 , and P2
has mean  2 and variance  22 . The covariance between these two populations is  12 . Then
SSMD (denoted as  ) is defined as the ratio of mean  D to standard deviation  D of the
D
D
difference D, namely  

1   2
. When two populations are independent, we are
 12  22  2 1 2
interested in unpaired difference between the two populations. The SSMD corresponding to
unpaired difference is called “unpaired SSMD”, which is   12  2 2 . If the two independent
 1  2
populations have equal variances (namely  12   22   2 ), then  
1   2
2 2
.
SSMD defined above is a population parameter which needs to be estimated from observed
samples. Suppose we have one sample (with sample size n1 , sample mean X 1 and sample
standard deviation s1 ) from Population P1 and another independent sample (with n 2 , X 2
and s2 ) from Population P2. Let N = n1 + n2. Zhang [1] derived maximum-likelihood
estimate (MLE) and method-of-moment (MM) estimate of unpaired SSMD when the two
compared groups have normal distributions with unequal variances. When the two compared
groups have normal distributions with equal variance, the uniformly minimal variance
unbiased estimate (UMVUE) of unpaired SSMD [2] is,
2
   N  2  
ˆ
 UMVUE 
, K  2    N23    N  3.5 when n1  2, n2  2 .
2
2
   
2


K ( n1  1) s1  ( n 2  1) s 2
  2 
It is well-known that if two random variables X and U are independently distributed with
X ~ N (  , 1) and V ~  2 ( p) , then the ratio Y  VX/ p has a noncentral t-distribution with p
X1  X 2
degrees
( n1 1) s12
2
X
1

of
freedom
( n2 1) s22
2
 X2
Therefore,
~  ( N  2)
X
1
 X2
( n1 1) s12
2
(n
1

X
noncentrality
( n2 1) s22
2
1
1
n2
1
n1
We
2


2
 n12
 1) s  (n2  1) s
2
1
.
1
n2
 , 1 .
) ( N  2)
 X2

1
n1

2
1
1
n1 n 2
(  )
1
n1

parameter
X 1  X 2 ~ N 1   2 , (  )
and

( n11  n12 ) 2 ~ N 

(
namely T 
and
2
2
2
~ noncentral t ( N  2,
 ( N  2)
2
1
1
n1 n 2
),
~ noncentral t ( N  2,
2
1
1
n1 n 2
).
know
,
that
namely
In primary HTS experiments, n1  1 for most investigated siRNAs. s12 does not exist when
X 11  X 2
where
n1  1. In this case, the UMVUE of unpaired SSMD is then ˆ UMVUE 
2
2
(
n

1
)
s
2
2
K
2
( 2 )
K  2   n22 2   n2  2.5
 ( 2 ) 
n 1
.

We
know

Therefore, T 
X
11
 X2
(1  )
s2
2
~  2 (n2  1)

(1  n12 ) 2 ~ N 

X 11  X 2 ~ N 1   2 , (1  n12 ) 2 , namely X 11  X 2 
1
n2
( n2 1) s22
that
~ noncentral t (n2  1,
2
1 n1
and

2
1 n1
 ,1 .

2
) .
2
If set (n1  1) s12  0 when n1  1, then for both n1  1 and n1  2 (i.e., n1  1 ),
T
(n
1
 X2
1
n1

1
n2

2
2
1  1) s1  ( n 2  1) s 2 ( N  2)
̂ UMVUE 
b
X
2
1
1
n1 n 2
2
K
(n
1
X1  X 2
K
2( N 2)

2
(
)
 kT where K  2    ( N23 )   N  3.5 ,   N  2 ,
 2 
N 2
 1) s12  (n2  1) s 22
and k 
~ noncentral t ( , b ) and
( n11  n12 ) .
Estimation of paired SSMD
When two populations are correlated, we are usually interested in paired difference. The
SSMD corresponding to paired difference is called “paired SSMD”. Suppose we observe n
pairs of samples, ( X 11 , X 21 ), ( X 12 , X 22 ), , ( X 1n , X 2 n ) from populations P1 and P2
respectively. Let Dj be the difference between the jth pair of samples, namely
D j  X 1 j  X 2 j . Let D and s D be the sample mean and sample standard deviation of D
respectively, namely D 
n
1
n
 D j and s D2 
j 1
n
1
n 1
 (D
j 1
j
 D ) 2 . Assume that D is normally
distributed, namely D ~ N ( D ,  ) . Then the MM, MLE and UMVUE of the paired SSMD
( n1 )
̂ MM  sDD , ˆ MLE  nn1 sDD and ˆUMVUE  n2 2 n21 sDD respectively. The proof of ML and
( 2 )
MLE is trivial. The proof of UMVUE is as follows.
2
D
When D ~ N ( D ,  D2 ) , there are the following properties: ( D , s D2 ) is a complete sufficient
statistic of (  D ,  D2 ) ; D and s D2 are independent with each other; and
 2 (n  1) . Based on these properties, we have


n 1
( n 2 2 )
1
1
1  2x
   x  12
2
E
x
e
dx

and
n 1
n 1
 (n  1) s 2  2  0
n 1
2

(
)
2

(
)
2
2
D
D 
2

( n 1) s D2
 D2
is distributed to
 ( n 1 )
 ( n 1 ) 2

D
1
2




E n 2 2 2
E
D
E
n

2
 ( 2 )
 (n  1) s 2  2
(n  1) s D2  ( 2 )  D
D
D


n 1
n 1
( )
( )
D
 n 2 2 n21 sDD . Then ˆ is
Set ˆ  n 2 2 2
( 2 )
( 2 )
(n  1) s D2
 
 D .
 D

a function of the complete
sufficient statistic ( D , s D2 ) and ˆ is an unbiased estimate of  . Thus, ˆ is a UMVUE of  .
D j ’s are independently distributed with N (  D ,  D2 ) , so
D ~ N (  D , 1n  D2 ) , namely
nD
D
( n 1) s D2
 D2
~  2 (n  1) and
~ N ( n  , 1) . Therefore,
nD
D
( n 1) s D2
 D2
Let T 
k
(n  1)
nD
sD
( n21 )
( n2 2 )
~ noncentral t (n  1, n  ) , namely
nD
sD
~ noncentral t (n  1, n  ) .
. Then T ~ noncentral t ( , b ) and ̂UMVUE  kT where   n  1 , b  n and
2
n ( n 1)
.
Confidence interval of SSMD estimates
Based on the estimates of SSMD and their distributions derived above, we have
T ~ noncentral t ( , b ) and ̂UMVUE  kT for both unpaired and paired SSMDs although
  n1  n2  2 , b 
2
1
1
n1 n 2
and k 
( n1  n22 2 )
(
n1  n2 3
2
)
1
n1  n2  2
( n11  n12 ) in unpaired SSMD and
( )
2
in paired SSMD. Let Ft ( ,b ) () be the
n ( n 1)
( )
cumulative distribution function of noncentral t ( , b ) and Tobs be the observed value of
T . Because T ~ noncentral t ( , b ) , we can find  L and  U such that
Ft ( ,b L ) (Tobs )  1  2 and Ft ( ,b u ) (Tobs )  2 . Then (  L ,  U ) is a 1   confidence interval
  n  1 , b  n and k 
n 1
2
n2
2
of SSMD. The variance of a noncentral t ( ,  ) is
     ( 21 )  2 

  2  2   ( )  (b ) 2 . Using ˆUMVUE  kT , the variance of
 2
 2  

2




 (  1 )
is Var ( ˆUMVUE )  k 2 Var T   k 2  2   2  2   (2 )  b 2  2  .
 2  



variance of T is
̂UMVUE
     ( 21 )  2  2

   2  2   ( )    . Thus, the
 2
 2  



False negative rate and restricted false positive rate
Let us focus on the situation where we want to select siRNAs with large positive effects,
namely the siRNAs with   c1 where  denotes SSMD and c1 is the preset lowest value
for large effects. In this situation, the FNR is the probability that we conclude   c1
whereas actually   c1 . The maximum FNR in a decision rule is called false negative level
(FNL). Traditionally, the false positive rate is the probability that we conclude   c1
whereas actually   c1 . However, in RNAi HTS experiments, scientists are usually
interested in controlling the probability of concluding   c1 given   c2 where
0  c2  c1 . This probability is called restricted false positive rate (RFPR) [2, 3]. The
maximum RFPR in a decision rule is called restricted false positive level (RFPL).
For example, for an observed SSMD value  obs (  obs  0 ), if we select all the siRNAs with
ˆ   as hits, the FNR in this process (for c  c  0 ) is
obs


1

2



 
FNR  Pr ˆ   obs |   c1  Pr T  k |   c1  Pr t ( , b )  kobs |   c1  Ft ( ,bc1 ) kobs
and the RFPR in this process is
RFPR  Pr ˆ   obs |   c2  Pr T  kobs |   c2  Pr t ( , b )  kobs |   c2  1  Ft ( ,bc2 )


 obs


; thus, FNL and RFPL in this process are Ft ( ,bc1 )


  and 1  F
 obs
k
t ( ,bc2 )
 
 obs
k
  respectively.
 obs
k
Similarly, for an observed SSMD value  obs (  obs  0 ), if we select all the siRNAs with
ˆ   obs as hits, FNL and RFPL in this process (for c1  c2  0 ) are 1  Ft ( ,bc1 ) kobs and
 
 
Ft ( ,bc2 ) kobs respectively. Consequently, when we use SSMD-based ranking method for
selecting siRNAs with a large positive value, in the process that we select all the m siRNAs
with ˆ   * (  *  0 ) as hits, the FNL and RFPL are Ft ( ,bc1 ) k* and 1  Ft ( ,bc2 ) k*
respectively; when we use SSMD-based ranking method for selecting siRNAs with a large
negative value, in the process that we select all the m siRNAs with ˆ   * (  *  0 ) as hits,
FNL and RFPL are 1  Ft ( ,bc1 ) k* and Ft ( ,bc2 ) k* respectively (Selection Criteria Ia and IIa
in Table 1).
 
 
 
 
Hit selection using SSMD-based testing methods
Based on T ~ noncentral t ( , b ) and ˆ  kT , we can determine a selection criterion so that
a specific FNL or RFPL can be achieved. To select siRNAs with large positive effects,
namely the siRNAs with   c1 ( c1  0 ), the following decision rule (namely Selection
Criterion Ib in Table 1) achieves FNL to be  .

if ˆ  kQt ( ,bc1 ) ( 1 )
 declare a hit ,
, where Qt ( ,bc1 ) ( 1 ) is the
Decision Rule Ib : 
ˆ  kQ
not
declare
a
hit
,
if

(

)

t ( ,bc1 )
1

 1 quantile of t ( , bc1 ) . The reason is as follows. The FNR for Decision Rule Ib is the
( ) (i.e., not declaring a hit) given   c . Hence,
probability that ˆ  kQ
t ( ,bc1 )

1
FNR  Pr ˆ  kQt ( ,bc1 ) ( 1 ) |   c1

 Pr t ( , b )  Q
 Pr t ( , bc )  Q
 Pr T  Qt ( ,bc1 ) ( 1 ) |   c1
1


t ( ,bc1 )
( 1 ) |   c1
t ( ,bc1 )
( 1 )   1

1

Therefore, FNL   when using Selection Criterion Ib.
Using Decision Rule Ib, the RFPR with respect to (w.r.t.) c2 and c1 (0  c2  c1 ) is
RFPR  Pr ˆ  kQ
( ) |   c

t ( ,bc1 )

 Pr t ( , b )  Q
 Pr t ( , bc )  Q
Q
 1 F
1
2
 Pr T  Qt ( ,bc1 ) ( 1 ) |   c 2

( ) 
( ) 
t ( ,bc1 )
( 1 )
t ( ,bc1 )
2
t ( ,bc2 )
t ( ,bc1 )


1
1
, where Ft ( ,bc2 )  is the cumulative distribution function of t ( , bc2 ) . Therefore,


RFPL  1  Ft ( ,bc2 ) Qt ( ,bc1 ) ( 1 ) when using Selection Criterion Ib. Similarly, we obtain
Selection Criteria Ic, IIb and IIc and their FNLs and RFPLs listed in Table 1.
References
1. Zhang XD: A pair of new statistical parameters for quality control in RNA
interference high-throughput screening assays. Genomics 2007, 89:552-561.
2. Zhang XD: A new method with flexible and balanced control of false negatives
and false positives for hit selection in RNA interference high-throughput
screening assays. Journal of Biomolecular Screening 2007, 12:645-655.
3. Zhang XD, Ferrer M, Espeseth AS, Marine SD, Stec EM, Crackower MA, Holder
DJ, Heyse JF, Strulovici B: The use of strictly standardized mean difference for
hit selection in primary RNA interference high-throughput screening
experiments. Journal of Biomolecular Screening 2007, 12:497-509.
Download