Empirical Bayesian estimation of the disease transmission probability in multiple-vectortransfer designs Christopher R. Bilder Department of Statistics, University of Nebraska-Lincoln, chris@chrisbilder.com, http://www.chrisbilder.com Joshua M. Tebbs Department of Statistics, Kansas State University tebbs@ksu.edu ABSTRACT: Plant disease is responsible for major losses in agriculture throughout the world. Diseases are often spread by insect organisms that transmit a bacterium, virus, or other pathogen. To assess disease epidemics, plant pathologists often use multiple-vector-transfers. In such contexts, s>1 insect vectors are moved from an infected source to each of n test plants. The purpose here is to present new estimators for p, the probability of pathogen transmission for an individual vector, motivated from an empirical Bayesian approach. In studying point estimate properties, one of our proposed estimators consistently results in a smaller bias and mean squared error than the maximum likelihood estimator (MLE) as proposed by Thompson (1962) and Swallow (1985). This bias reduction is frequently fivefold or more in optimal settings for the MLE. Furthermore, these estimators are easier to compute than the classical Bayes estimators proposed by Chaubey and Li (1995) and Chick (1996). Finally our newly proposed empirical credible intervals possess the desirable property that lower bound will never be negative. 1 Background Plant disease is responsible for agricultural losses throughout the world Diseases are often spread by insect vectors (e.g., aphids, leafhoppers, planthoppers, etc.) Brown planthopper Whitebacked planthopper Vector-transfers are often used by plant pathologists wanting to estimate p, the probability of disease transmission for a single vector 2 Background Experimental set-up (group testing application) o Insect vectors are moved from an infected source to test plants in a greenhouse o Each enclosed test plant has s insect vectors (assume common group size as recommended by Swallow (1985)) o n = number of test plants o Yi = 1 if ith test plant becomes infected; Yi = 0 otherwise o Want to estimate p, probability an individual vector transmits the pathogen Notation n o {Y i }i = 1 i.i.d. Bernoulli( ) random variables o = 1 – (1-p)s = probability plant becomes infected n o T = å Y i = # of infected test plants ~ Binomial(n, ) i= 1 o MLE for is q̂ = T/n o MLE for p is ˆpMLE = 1 – (1 – T/n)1/s 3 Past research Chaubey and Li (1995) and Chick (1996) use a two-parameter beta prior for p where hyperparameters are chosen a priori o Possible poor choices for hyperparameters could cause posterior distribution to be concentrated away from the truth o Multiple-vector-transfer experiments often use small n Tebbs, Bilder, and Moser (2003) derive parametric empirical Bayes estimators using one-parameter beta prior for p PURPOSE HERE: Develop new parametric empirical Bayes motivated estimators for p which have smaller bias and mean square error than those in Tebbs et al. (2003) Form an interpretation for the hyperparameter Examine frequentist coverage properties of credible intervals 4 Bayes Estimators Prior distribution (1 – p) –1 for 0 < p < 1 30 0 10 20 f(p) 40 50 o One-parameter beta family: fP(p| ) = o Example with = 52.4 0.00 0.02 0.04 0.06 0.08 p o Why one-parameter beta? Values of p are usually close to 0 MLE is positively bias Computation and interpretation simplifications Posterior distribution fP |T ( p | t, b ) = s G( n + b / s + 1) (1 - p)s (n - t )+ b - 1[1 - (1 - p)s ]t for 0<p<1 G( n - t + b / s )G( t + 1) Bayes estimators for p - Value of a with respect to loss function L(p,a) which minimizes EP|T[L(P,a) |T = t] L(p,a) = (p – a)2 G(n + b / s + 1)G(n - t + b / s + 1 / s ) G(n - t + b / s )G(n + b / s + 1 + 1 / s ) o Derived by Tebbs et al. (2003) o ˆp1 = 1 - 5 Bayes Estimators New estimator o Let U = 1 – (1 – P)s and note that U|T=t ~ beta(t + 1, n – t + /s) o EU|T[(U – a)2 | T=t] is minimized when a = E(U|T=t) = (t + 1)/(n + /s + 1) o Since P = 1 – (1 – U)1/s and substituting E(U|T=t) for U, we arrive at a new 1/ s æ ö t + 1 ÷ estimator ˆp2 = 1 - çç1 ÷ ÷ ÷ çè n + b / s + 1ø o This is NOT necessarily a Bayes estimator o The estimator can also be derived another way: Choose a beta(1, /s) prior for and L( ,a) = ( – a)2 Bayes estimate for is (t + 1)/(n + /s + 1) Substitute the Bayes estimate for into p = 1 – (1 – )1/s 6 Empirical Bayes Estimators Marginal distribution for T fT (t | b ) = bG(n + 1)G(n - t + b / s ) s G(n - t + 1)G(n + b / s + 1) for t = 0, 1, …, n Marginal MLE for o Maximize f(t | ) with respect to ¶ o Solve logfT (t | b ) = b - 1 + s - 1 [Y (n - t + b / s ) - Y (n + b / s + 1) ]= 0 ¶b for to find b̂MLE where ( ) is the digamma function Marginal MOM estimator for o Set ET[T] = t to find that b̂MOM = s(n – t)/t = s(1 - ˆq) / ˆq o Interpretation: b̂MOM = (# of vectors per plant) (non-infected prop.) / (infected prop.) = (group size) (group failure prop.) / (group success prop.) o Choosing s is important in order to prevent poor estimates of p; i.e., need to choose s so that is not close to 0 or 1 Rule of thumb is to choose s so that approximately ½ test plants are positive and ½ test plants are negative Substituting ½ for q̂ into b̂MOM leads to b̂MOM s o Although one can think of = ½ as a “target value,” optimal group sizes may actually lead to an expected proportion of positive host plants being anywhere from 0.2 to 0.8 (Swallow, 1985, 1987) 7 Estimators and Methods of Comparison The estimators: o ˆpEB 1 = 1 o ˆpEB 2 = 1 o ˆpEB 3 = 1 o ˆpEB 4 = 1 ˆpEB 4 = 1 - G(n + ˆbMLE / s + 1)G(n - t + ˆbMLE / s + 1 / s ) ˆ MLE / s + 1 + 1 / s ) G(n - t + ˆbMLE / s )G(n + b 1/ s é ù t+1 ê1 ú ê n + ˆbMLE / s + 1 ú ë û G(n + ˆbMOM / s + 1)G(n - t + ˆbMOM / s + 1 / s ) ˆ MOM / s + 1 + 1 / s ) G(n - t + ˆbMOM / s )G(n + b 1/ s é ù t + 1 ê1 ú which reduces to ê n + ˆbMOM / s + 1 ú ë û 1/ s é tù ê1 - ú = ˆpMLE using b̂MOM = s(n – t)/t êë n ú û Bias and MSE for an estimator ˆpi æn ö t é1 - (1 - p )s ù (1 - p )s (n - t ) o Bias(ˆpi ) = å (ˆpi - p ) ççt ÷ ÷ ÷ë û çè ø ÷ t= 0 æn ö n t 2ç ÷ é1 - (1 - p )s ù (1 - p )s (n - t ) o MSE (ˆpi ) = å (ˆpi - p ) çt ÷ ÷ë û çè ø ÷ t= 0 n o t = 0 and n are excluded from the calculations By choosing an appropriate s, t = 0 and n can be avoided b̂MLE = for t = 0 and b̂MLE = 0 for t = n; if we used t = 0 + n – for a small constant > 0 (instead of t = 0 and n), the conclusions presented here do not change and t = Relative Bias = Bias(ˆpMLE ) Bias(ˆpEB ,i ) Relative Efficiency = MSE (ˆpMLE ) MSE (ˆpEB ,i ) 8 Relative Bias and Relative Efficiency Plots 15 n=80 and s=25 0.04 0.06 0.08 10 5 0.10 0.02 0.02 0.04 0.06 p n=30 and s=10 n=80 and s=25 0.04 0.06 0.08 0.10 0.6 0.7 0.8 0.9 1.0 1.1 1.2 p EB1 EB2 EB3 0.00 0.00 Relative efficiency 0.02 0.6 0.7 0.8 0.9 1.0 1.1 1.2 0.00 Relative efficiency EB1 EB2 EB3 0 5 10 Relative bias EB1 EB2 EB3 0 Relative bias 15 n=30 and s=10 0.08 0.10 0.08 0.10 EB1 EB2 EB3 0.00 0.02 0.04 0.06 p p Relative bias or relative efficiency > 1 means ˆpEB ,i is better than ˆpMLE 9 9 Relative Bias for optimal MLE settings (Swallow, 1985) n= s= ˆpEB 1 10 35 0.96 20 50 0.93 30 50 80 50 50 50 0.93 0.94 0.94 0.01 ˆpEB 2 ˆpEB 3 1.95 5.11 6.40 8.19 9.85 10.59 12.52 0.63 0.52 0.51 0.51 0.51 0.50 0.50 s= ˆpEB 1 19 0.97 35 0.90 45 50 50 0.87 0.86 0.87 50 0.87 50 0.02 ˆpEB 2 ˆpEB 3 0.87 2.21 4.84 4.87 5.16 5.64 5.83 6.24 0.61 0.51 0.50 0.50 0.50 0.50 0.50 s= ˆpEB 1 14 0.97 25 0.90 30 40 45 0.88 0.84 0.82 45 0.83 50 0.03 ˆpEB 2 ˆpEB 3 0.81 2.59 4.89 5.07 4.47 4.29 4.40 4.15 0.58 0.51 0.50 0.50 0.50 0.50 0.50 s= ˆpEB 1 9 1.00 16 0.91 20 25 25 0.88 0.84 0.85 30 0.81 30 0.05 ˆpEB 2 ˆpEB 3 0.82 3.06 5.14 4.96 4.49 4.88 4.07 4.29 0.57 0.51 0.50 0.50 0.50 0.50 0.50 s= ˆpEB 1 6 1.04 10 0.94 13 16 17 0.89 0.85 0.84 17 0.84 18 0.08 ˆpEB 2 ˆpEB 3 0.83 3.91 6.08 5.29 4.69 4.72 4.87 4.78 0.56 0.51 0.50 0.50 0.50 0.50 0.50 s= ˆpEB 1 5 1.07 8 0.96 10 12 13 0.91 0.87 0.86 14 0.84 14 0.1 ˆpEB 2 ˆpEB 3 0.85 4.77 7.00 6.11 5.45 5.31 4.89 5.22 0.56 0.51 0.50 0.50 0.50 0.50 0.50 p p p p p p 100 50 0.94 200 50 0.94 10 Relative Efficiency for optimal MLE settings (Swallow, 1985) p 0.01 p 0.02 p 0.03 p 0.05 p 0.08 p 0.1 n= s= ˆpEB 1 10 35 0.98 20 50 0.99 30 50 0.99 50 50 0.99 80 50 1.00 100 50 1.00 ˆpEB 2 1.00 1.15 1.08 1.05 1.03 1.02 1.02 ˆpEB 3 1.01 0.84 0.90 0.93 0.96 0.97 0.98 0.99 s= ˆpEB 1 19 0.98 35 0.97 45 0.97 50 0.98 50 0.99 50 0.99 50 ˆpEB 2 1.00 1.15 1.10 1.08 1.05 1.03 1.02 ˆpEB 3 1.01 0.84 0.87 0.89 0.92 0.95 0.96 0.98 s= ˆpEB 1 14 0.98 25 0.97 30 0.97 40 0.97 45 0.98 45 0.98 50 ˆpEB 2 0.99 1.15 1.10 1.08 1.06 1.04 1.03 ˆpEB 3 1.02 0.83 0.86 0.89 0.90 0.93 0.94 0.97 s= ˆpEB 1 9 0.98 16 0.97 20 0.97 25 0.97 25 0.98 30 0.98 30 ˆpEB 2 0.99 1.15 1.11 1.08 1.06 1.04 1.04 ˆpEB 3 1.02 0.83 0.86 0.87 0.90 0.94 0.93 0.97 s= ˆpEB 1 6 0.99 10 0.98 13 0.97 16 0.97 17 0.98 17 0.98 18 ˆpEB 2 0.99 1.15 1.10 1.09 1.06 1.04 1.03 ˆpEB 3 1.02 0.84 0.86 0.87 0.90 0.93 0.94 0.97 s= ˆpEB 1 5 1.00 8 0.98 10 0.98 12 0.98 13 0.98 14 0.98 14 ˆpEB 2 0.99 1.15 1.10 1.08 1.06 1.04 1.03 ˆpEB 3 1.02 0.85 0.87 0.88 0.91 0.93 0.94 0.97 200 50 11 Example Ornaghi et al. (1999) study the effects of the “Mal Rio Cuarto” (MRC) virus and its spread by the Delphacodes kuscheli planthopper o The MRC virus is the most-damaging maize virus in Argentina o It was desired to estimate p, the probability of disease transmission for a single vector Female planthoppers in the 4th stage o s = 7 planthoppers per plant o n = 24 plants o t = 3 infected plants observed The estimators: o o o o ˆpEB 1 = 0.018857 where b̂MLE = 52.4 ˆpEB 2 = 0.018596 where b̂MLE = 52.4 ˆpEB 3 = 0.019165 where b̂MOM = 49 ˆpEB 4 = ˆpMLE = 0.018895 where b̂MOM = 49 12 Summary ˆpEB 2 é t+1 = 1 - ê1 ê n + ˆbMLE / s + ë 1/ s ù ú 1ú û results in a significant reduction of bias and moderate reduction in MSE when compared to the MLE Other estimators o The median and mode of fP|T ( p | t, b ) result in estimators which at times can be better than ˆpEB 2 ; however, ˆpEB 2 is much more often better in terms of bias and MSE o Burrows (1987) presents a frequentist estimator based on the MLE with a bias correction which predominantly does better than all estimators examined here with respect to bias reduction; ˆpEB 2 and the Burrows estimator are much closer with regard to MSE reduction o There is no uniformly superior estimator! Interval estimators for p o See Tebbs and Bilder (JABES, 2004) for frequentist interval comparisons o Equal tail and highest posterior density region credible intervals usually have poorer coverage than a Wald confidence interval for p (of course, the interpretation of the intervals differ) Our examination did not take into account the variability in the estimate of o The credible intervals possess the desirable property that the lower bound will never be negative (unlike the Wald interval) 13