POOLED TESTING FOR HIV SCREENING: CAPTURING THE DILUTION EFFECT Lawrence M. Wein Sloan School of Management, M.I.T. and Stefanos A. Zenios Operations Research Center, M.I.T #3665-94-MSA March 1994 Abstract We study pooled (or group) testing as a cost-effective alternative for screening donated blood products (sera) for HIV; rather than test each sample individually, this method combines various samples into a pool, and then tests the pool. A group testing policy specifies an initial pool size, and based on the HIV test result, either releases all samples in the pool for transfusion, discards all samples in the pool, or divides the pool into subpools for further testing. We develop a generalized linear model that relates the HIV test output to the antibody concentration in the pool, and hence captures the effect of pooling together different samples. A group testing policy spec- test result, either releases all samples in the samples in the pool, or divides the pool into subpools develop a generalized linear model that relates the HIV test for output to the antibody concentration in the pool, and hence captures the effect of pooling together different samples. studies, and is The model embedded is into a validated and simplified using data from a variety of dynamic programming algorithm that derives a group field testing policy to minimize the expected cost due to false negatives, false positives and testing. A simulation study shows that significant cost savings can be achieved without compromising the accuracy of the test. classification rule (that ther) that is However, the efficacy of group testing depends upon the use of a is, discard the samples in the pool, transfuse dependent on pool size, a characteristic that pooled testing procedures. February 18, 1994 is them or test them fur- lacking in currently implemented In the first years of the AIDS epidemic, numerous instances of by blood transfusion were reported to the Center cated that the blood supply is safer is all incidence indi- a virtually frictionless pathway for spreading the epidemic, infected blood donors would be identified and many developing be at the individual level should blood supply would be attained. Nevertheless, the cost substantial, infection caused The for Disease Control. and the extent of the epidemic dictated that screening adopted. As a consequence, AIDS for and a measurably such a screening program countries, particularly in Africa where the epidemic is spreading rapidly, are struggling to fight the disease on limited budgets. Pooled testing is one potential way to reduce the monetary cost without compromising the accuracy of the tests. we can pool the If The rationale behind pooled sera from ten (for example) individuals the seroprevalence of HIV, which enough, then there in this case, (either is If, is and is simple and intuitive: suppose test the pool using a single the fraction of the population that a high probability that we would individual tests. testing all learn from a single test on the other hand, the is infected, ten individuals in the pool are HIV is test. low negative; what otherwise would be learned from ten test outcome is positive, then additional tests pooled or individual) would need to be carried out. However, pooled testing has a possible shortcoming, the dilution serious concern that diluted so as to if the pool size is too large, then any become undetectable by the costly, particularly when pooled testing is test. HIV effect: positive sera will be there is a sufficiently These false negatives can be extremely employed to protect the blood supply. Moreover, infected individuals exhibiting an unusually low level of antibody concentration are less likely to be detected when screened can be seriously affected. (Sensitivity whereas in pools. is Consequently, the sensitivity of the test the probability of detecting a diseased individual, specificity is the probability of detecting a healthy individual.) Pooling methods have been evaluated in blood banking systems in several developing countries, including Zaire, 1988, Emmanuel 1990). 80) may be from 5% to 1988, Kline et et al. These al. field Zimbabwe and Ecuador 1989, Behets et al. studies suggest that pooling as sensitive and sensitivity of the test, is more conservative On the dilution effect pools of size no greater than five in Tamashiro et studies and the recommendations of the (1993). al. WHO is its sizes as large as result in cost savings the other hand, the World and its consequences on the They recommend The discrepancy between in is employed testing, which is draftees for syphilis, where and Groll 1959), and a large et al. motivated by tests (i.e., HIV tests (1987), literature testing. is all called group exists on and 60's, (see, for this topic; al. in screening The group example, Sobel readers are referred to (1992) for recent work that existing studies concentrate is on either perfect with no misclassification errors) or imperfect tests with errors that are who assume outcome in the 1950's now However, nearly size; two exceptions are Hwang (1976) and Burns and Mauro that test sensitivity studies neglect the actual test test resulted in considerable savings. (1991) for a survey, and to Litvak et independent of the group all it problem was researched aggressively Johnson is to efficiently eliminate all defective items from certain large populations. The method found an immediate application testing effect danger of either underestimating Dorfman (1943) showed how pooled testing in the statistical literature, can be II the importance. In his seminal paper, World War the use of an indication that the dilution not well understood, and the bloodbanking community or overestimating and can in their proposal: et al. 1990 and Ledro-Monroy et methods (with group specific as individual testing, (WHO), concerned with example, Cahoon- Young al. 80%, depending on the actual seropre valence. Health Organization field (see, for is a specified function of the group mechanism and, except for size. In addition, Arnold (1977), assume that the binary rather than continuous. In contrast, we attempt to explicitly model both the dilution effect and the continuous nature of the test outcome. Our task outcome, which is greatly complicated by the fact that the a continuous quantity called the optical density is measurement of the unobservable antibody concentration. we derive a generalized linear model that concentration of the tested sera. and model Starting from A negative, then the pool is released for transfusion, divided into subpools for further testing. the sample discarded is if policies are characterized the test outcome by the initial is If HIV and if the test must deemed HIV sample in the pool testing. Our is is HIV positive, then the group testing embedded test outcome, our group testing based on the test outcome: the pool is released for transfusion), or the pool validated pooling model is pool size and the resulting subpool configuration. positive (each sample in the pool is HIV the test positive. Hence, traditional also develop a classification rule that either is if the pool consists of a single sample, then Because we explicitly consider the continuous nature of the policy antibody validated using data from an existing pooling study. is is principles, simplified version of the Traditional group testing problems consider a binary test outcome; pool first explicitly captures the physical pooling of sera, validated using data from two existing dilution studies. is test only an indirect level, is relates the optical density level to the The model HIV into a discarded) or is HIV is negative (each divided into subpools for further dynamic programming framework that derives the group testing policy that minimizes the expected cost due to testing, false positives and false negatives. Our proposed policy is tested on a Monte Carlo simulation model, and the results indicate that pooled testing, with a classification rule that explicitly depends on the pool The paper assay used for in Section 2. is HIV The size, can achieve significant cost savings over individual testing. organized as follows. testing, is A preliminary description of given in Section generalized linear model is 1. The data used developed in the in Section 3, validated in Section 4 using the data described in Section 2. EL ISA, A the biological paper are described and is simplified and dynamic programming framework A derived. Section for the group testing problem simulation study is is undertaken developed and several in Section 5, policies are and concluding remarks appear in Section 6, in 7. 1. Serological Tests for The human body AIDS by reacts to microbial agents, like viruses, bacteria, parasites, etc., producing antibodies. The antibodies recognize particular molecules on the surface of the infectious agent and bind to them. Such molecules are called antigens (anybody generators). Various immunological tests are designed to detect antibodies, thereby identifying the serological status of the individual. The Human Immunodefficiency Virus (HIV) Immune HIV the Deficiency is the pathological agent of the Acquired Syndrome (AIDS). Enzyme Linked Immunosorbent Assays (ELISA) virus detect the anti-HIV antibodies and are frequently used for HIV for screening. This section contains a brief nontechnical description of ELISAs. A common (see configuration of ELISAs George and Schochetman 1985 phase support (usually wells). The for more patient's by the manufacturer), added to the By for solid HIV is details). serum the indirect assay pictured in Figure Antigens to (or plasma) the end of the incubation period, any antibodies to unattached material is is added. HIV test outcome When is that are present in the sample The well is then washed so that a substrate is finally when a secondary antibody, the optical density (OD) The labeled by added, an enzymatic reaction takes place producing a color change proportional to the amount of ELISA diluted (at a dilution fixed removed, and the attached antibodies become detectable. attached antibodies (immunoglobulins) are detected an enzyme, are attached to a solid phase support and incubated for a time period. are attached to the antigens on the solid phase support. all is HIV 1 level, human HIV antibodies present. which quantifies The this color change. colorless substrate wash wash + + colored HIV antigens Figure Hence, the and OD 1: reading their ability to bind HIV specific Secondary antibodies antibodies A is schematic representation of the indirect ELISA. determined by two factors: the concentration of the antibodies manufacturer, then the patient is declared alternative configuration of types of antigens used in indirect method critical value, or cutoff, HIV recommended by the positive; otherwise the patient differs ELISAs ELISAs from the indirect one is is declared the competitive assay. Although the same are attached to the solid phase support, this in the detection mechanism. antibodies compete with the patient's antibodies for binding inversely proportional to the concentration of OD OD negative. An is affinity). If the on the antigens on the solid support (antibody recorded at the end of the process exceeds the HIV product level exceeds the critical negative, otherwise positive. HIV sites. The Enzyme labeled color change observed antibodies in the serum. If the recorded value set by the manufacturer, then the sample We HIV is declared concentrate on indirect assays in this paper, since most of the commercially available antibody detection kits are based on the indirect configuration of ELISAs. Nevertheless, the study of competitive assays minor modifications, stated when necessary, are required. is not any more difficult and only ELISAs a shortcoming that stems from the antibodies. and very accurate; however, they have axe inexpensive, easy to administer The test's patient's time of infection antibody concentration in the patient's indirect detection of is serum followed by a is extends from three to nine months and results HIV via the presence of window period during which the virtually undetectable. This period usually in false negatives. Assays HIV for detecting antibodies cannot identify such individuals; therefore, whenever individuals are referred to as positive or negative, we are actually alluding to the presence or absence of HIV antibodies. Description of the Data 2. We use individual testing data, dilution series data and pooled testing data obtained from three independent sources. itive individuals OD pos- screened using four different assays were provided by the National HIV readings for 4000 Reference Laboratory of Australia (Dax 1993). ings according to the equation Am are the values of A x = ^ minimum and maximum and Am D ~^ Q OD , ELISA) for HIV so that they between zero and one; read- A and The vary by assay, and were chosen based on an analysis of the data and are given in Figure 2(a). negative than for we is HIV We set A = OD readings for assay observe that both the positive individuals. The and Am = mean and 20. A (an indirect variance are smaller relatively large spread in the HIV to be expected, since an individual's antibody concentration tends to systematically vary as the disease progresses; see The two populations fall OD readings, respectively, recorded by the assay. empirical distributions for the normalized positive distribution negative and 3000 convenient to normalize the It is discussions with the data providers. For this data, The HIV HIV Individual Testing Data. George and Schochetman for details. are well separated, and therefore a critical value separating the outcomes into HIV positive and HIV negative can be selected. OD O o o o o c c o Ol => o cm o o o S a. o o GO 9 o ™ CD > Z > a C\J c o 3 o o r- <S in <o 2 o a > z o c o o -^mmfiMfllill 0.0 0.2 0.4 OD 0.S 0.8 1.0 readings LOD (a) Figure HIV 2: (a) (b) Empirical densities for the reactivity ratios of 4000 positive individuals, For reasons that and will in OD) readings. Figure 2(b). <7_ = The (b) densities for the become OD mation of the normalized (logit readings HIV we clear in Section 4, readings: x — » LOD negative and 3000 values. also consider the logit transfor- ln(yf^), which will be referred to as the empirical densities of the The sample mean and standard 0.42 for the corresponding HIV negative population, LODs for the two populations are given deviation are, respectively, /z_ and fj,+ = 0.80 LOD and a + = = —4.82 and 1.08 for the HIV positive population. Figure 3 displays the normal quantile plot for the empirical distributions in Figure 2(b); that is, the LOD standard normal quantiles. tile plot of the LOD HrV The quan- straight line indicates normality of the data points. is approximately linear in the tails of positive population. of the A readings normality are observed HIV readings are ranked in magnitude and are plotted against the the HIV for both populations. Deviations from negative population and the right tail Most importantly, the normal approximation captures the positive distribution, which contains the low detectable under pooled testing. On OD readings that might of the left tail become un- the other hand, the normal approximation to the HIV ° ^ _O CM o -2 Figure 2 2 Quantiles of Standard NormaJ Quantiles of Standard Normal (a) (b) NormaJ quantiles 3: •2 for the LOD readings of (a) HIV HIV negative and (b) positive populations. negative population underestimates the proportion of negative individuals with a relatively high OD reading, which might lead to an underestimation of false positives. Nevertheless, the false negatives, which are the overriding concern in pooled testing, will not be affected. In the analytical the /i_ HIV = model developed in Section 5, we assume that the LOD readings for negative and positive populations are normally distributed with respective means —4.82 and fi + = 0.80, and respective standard deviations o_ Dilution Series Data. — 0.42 and a + = 1.08. Dilution series data were obtained from the Caribbean Epidemiology Center (Hull 1991 and de Gourville 1992) and the National HIV Reference Laboratory of Australia (Dax). The purpose of both of these studies was to investigate the effect of dilution on the ability of serum sequentially in a fixed negative 1 to detect reactive sera. Caribbean Epidemiology Center (CAREC) study, ten positive sera were diluted In the the ratios ELISAs : 1, 1 : 4, 1 of the positive sample. : 16,. , . . 1 : 4 to produce a 12 . Each dilution was A 1 : n series of thirteen four-fold dilutions in ratio means that £ tested by two indirect of the pool consists ELISAs according to the manufacturer's instructions. Since the data from both assays yielded similar results, we only report the results from one of them. The raw data consists of 130 OD We each of the thirteen dilution levels of each of the ten positive samples. Am = OD 15 to normalize the readings, one for used A = and readings. The National HIV Reference Laboratory of Australia (NRL) study sequentially diluted ten positive sera in a fixed negative serum to produce a series of 11 two-fold dilutions, with ratios 1 : 1,1 : 2, 1 : 4, . , . 1 . : 2 10 . These dilutions were tested on ten different assays. We analyzed the data from several of these assays and obtained very similar results, and hence will Aq only report on the results from one assay. = and A2 = OD readings were normalized using 2. Pooled Testing Data. hereafter as The Cahoon- Young Cahoon- Young et al., tested et al. (1992), which will be referred to 1280 specimens individually and in a series of nested pools. More specifically, the individual specimens were pooled to generate 128 pools of size 10; the pools of size 10 were then of size 40 and finally 16 pools of size 80. combined to form 64 pools of size 20, then 32 pools Twelve individuals were found to be HIV and no more than one positive sample was found in any of the pools of size 80. positive, The OD readings at every stage of this nested testing procedure were recorded. Note that the dilution al.'s series studies by CAREC and NRL differ from Cahoon Young et pooling study in one important respect: positive sera are diluted with varying amounts of the same negative sera in the dilution series studies, whereas individual sera are combined with a varying number of different individual's sera in Cahoon- Young et al.'s pooling study. Hence, although the two dilution series can be used to assess the effect of dilution, the Cahoon-Young testing. et al. study exactly mimics the pooling that would take place under group A 3. When Probabilistic Model for the Dilution Effect sera are screened in pools, the and affinity of the antibodies in the pool. and affinity makes we develop a it The stochastic CAREC and NRL predicts the OD level of We affinity. OD OD The model level of OD In this section, a sample as a function of linear et model (GLM) that HIV level of the adapted is Cahoon- Young data of CAREC and a pool. level of then specialize the model to the setting of a pool as a function of the consisting of individual samples, as in the series determined by the concentration and obtain a generalized dilution studies, level. is unobservability of the antibody concentration model that predicts the and the corresponding dilution The dilution reading very difficult to estimate the the antibody concentration and the OD positive sample in Section 4 to consider pools al. study. NRL essentially generate dose-response curves: the dose takes the form of a fixed positive sample diluted to various levels, and the response is simply the corresponding OD reading. Empirical dose-response curves typically exhibit sigmoid or hyperbolic behavior, and polynomial, general curvilinear, sion models have been proposed to fit Before our model these curves. worthwhile reviewing the traditional approach, and we focus on the concreteness. Let V} denote the Figure 2(b)) of a particular ing, as in where d LOD is regression an integer (d = 4 for reading (that HIV CAREC is, = 2 for introduced, is is NRL) and = 0, 1, . . , . n. The Cj read1 : dJ , linear model hypothesizes that Yj^a + Pj + ej, where OD diluted to the ratio j it is logistic regression for the logit of the normalized positive sample that and d logit or probit regres- are iid normal random variables with zero (1) mean. Although generates predicted values that coincide well with observed values, it this model typically exhibits considerable heteroscedasticity (state-dependent noise), and hence one of the model's basic assumptions is violated (see Tijssen 1985, Chapter 15). 10 Whereas the existing literature has taken a purely empirical approach to fitting dose- response curves, we develop a probabilistic model that assumptions regarding the behavior of the leads to a GLM ELISA for the dose-response curve: test based upon a is and the pooled variable is sera. Our analysis recall that while a linear regression postulates that the expected value of the dependent variable independent variable, a set of primitive is model a linear function of the GLM assumes that a function of the expected value of the dependent a linear function of the independent variable. Like model the sigmoid nature of the dose-response curve. In addition, it our (1), GLM captures proposes a particular variance function that, as will be seen in the next section, stabilizes the heterogeneous noise present in the CAREC and Our model estimate the NRL sets. influenced by Fisher (1922), is number data who developed a of bacteria in a sample of water or are used to derive our model. We soil. The probabilistic model to following eight assumptions conferred with several specialists, and none of these assumptions generated any disputation; assumption 5 was the only one that appeared to stimulate any reflection. HIV Al. The number A2. No more than one HIV antibody can bind A3. The of antigens, n, attached to a well satisfies n > for antibodies A A5. The normalized secondary antibody bodies. OD will bind to reading is all . is small, independent for from the same serum. The expected number of attached antibodies on a large collection of antigens A4. 6 to any antigen. probability of an antibody binding to a specific antigen each antigen, and constant 10 is significant. attached primary antibodies. linearly proportional to the number of attached anti- , A6. The expected number HIV of attached antibodies is linearly proportional to the anti- body concentration. The proportionality constant can vary among due to differences A7. binding properties in their The antibody concentration of pooled sera different individuals (affinity). is the weighted average of the individual antibody concentrations. A8. If Measurement errors are negligible. a competitive assay A5c. is The normalized employed, then assumption OD reading A5 is replaced by number of attached secondary proportional to the is antibodies, and the following assumption A9. The affinity of is introduced: primary antibodies secondary antibodies will is higher than that of secondary antibodies; therefore, bind to antigens on the solid support not enough primary antibodies to saturate the binding In this case, the subsequent model derivation if and only if there are sites. follows virtually unchanged. Motivated by Fisher's analysis, we consider a well with n antigens bound on introduce a partition of the well into k subwells indexed s antigens are uniformly bound on the every subwell. Suppose that serum and then added to the is observable, refer to the and well. i is well, we let m= diluted with an = 1 , . . , . £ denote the HIV Assuming that the number of antigens on negative serum in the ratio 1 : d? Since neither the concentration nor the affinity of antibodies since their net effect is multiplicative in nature by A6, product of the antibody concentration and the antibody concentration. Let pi k. and it, denote the antibody concentration the antibody concentration of the HIV for the we will hereafter affinity as the undiluted serum antibody i, pica be negative serum, p^ be the antibody concentration 12 , for the diluted serum, and p XJ be the binding probability note that none of these quantities are observable. antibodies attached to the antigens on subwell Our model development distribution for = S^* (JVyi + . . which l main is A6 to relate the antibodies per subwell Sl} k represent the number of probability p tJ to the we use assumption A5 find the probability number of attached antibod- p, r Then we use assumption unknown antibody concentration to relate the average OD to the normalized we First, the average per subwell, in terms of the unknown binding probability the diluted serum. Finally, lJS steps: ies unknown binding N Also, let serum; s. consists of three + N jk)/k, for the antibodies in the number in of attached Combining these three steps reading. yields our basic model. By our comments with size parameter m above, /VtJ i, . , . . N tJ are independent binomial k and success probability piy By A3 and the law random variables of rare events, the binomial random variable can be approximated by a Poisson random variable: P(Nijs = k)*e-"^^, where 1) = Pi ]m = mpij. We can choose a sufficiently fine partition of the well such that P(NlJS > o(pijm ), implying the Central Limit Sijk 0) a P(NlJS = l) « l-e _p" m Theorem and l — P(NlJS = P(Nijs >l) a By (2) ™ N(\ - e- p e- p (3) and (4) o( Pl]m ). (5) (3)-(5), 'J"-,le- p ^(l - e- p >"")) as k -+ oo, (6) k and hence ln(£(l - Sijk )) = -pijm 13 . (7) Since ]C*=i ^ijs is a binomial random variable, the distribution of by the normal distribution even 4, for relatively small values of Sl]k is approximated well k (typically k > 15). In Section the parameters of our resulting model are estimated from the data, and k approximately equals 20. Assumption A6 implies that Pijrn = -jr, (8) which relates the binding probability to the antibody concentration Let X that X = OD reading, Sijk = by denote the normalized 'ySi-jk, tj 1 Xj X — i} (3)- (5), where 7 1, is reading of the diluted sample. the constant of proportionality. is attained and so 7 OD = 1 when antibodies (6) relating the normalized and OD are serum. Then A5 implies The maximum normalized bound on all subwells. In this case, and Xij Combining equations in the diluted (8)-(9) = Sijk- (9) and taking logarithms gives the basic stochastic model reading to the antibody concentration: Xa-NiEiXiil^EiXdil-ElXv])), (10) ln(-ln(l-E[X ]))=ln(^). (11) where tJ In Section 4, this basic model will be adapted to the Cahoon- Young et study. By A7, Now we specialize this model CAREC to the setting of the the antibody concentration in the diluted serum p l0 + (d J - and NRL al. pooling dilution studies. is l)p ioo P«j dj ~ PlO (1+ dJ ^Aoc ) p.o 14 _ (12) Combining equations « ln(l 4- x) (11) and adequacy of (the ln(- The random component The is link this and using the approximation of the -£[*„]))= ln(l model normally distributed by is be investigated during the model will In (^)-j In d. OD the normalized level (13) X Xj The systematic component (10). of the diluted sample, is the dilution level between the random component and the systematic component (Mj (cloglog), = E(Xij), the link function and component is oii = ln(^) and given by dispersion parameter We <t> j3 g 4>Hi 3 (l (14) i x : = — \nd Var{XX] ) = equals is — ln(— ln(l are constants. — ^ tJ ), which j. of the form is = a + 0j g(fMj ) where order Taylor series approximation first GLM validation phase) gives the which (12), — x)), the complementary The second moment will be denoted by of the log-log random where the V(f^ij), £. conclude this section with several remarks about the GLM It (13). captures the sigmoid nature of the dose-response curve via the cloglog link function. Other suitable sigmoid link functions are the and the probit by g To obtain the Section 4. best : x fit Notice that — $ _1 (;c), of the data, if set the dilution level j — we 0, normally distributed, which 2, and that The will logit be used and probit. where $ GLMs is The logit link is defined by g x — for all three link functions will be considered in replace the cloglog function by the logit function in (11) and then this equation implies that individual is LOD readings are consistent with the assumption that was discussed in Section in Section 5. OD level of the and hence provides a measure of the antibody concentration of the positive sample; our ln(y^) the cumulative standard normal distribution. y-axis intercept ai corresponds to the cloglog of the normalized positive sample, : model predicts that a x = 15 }n(p l0 /k), which is original perfectly consistent with The this interpretation. slope = — In d dilution level j; hence, and k are not observable, and d y— intercept and the GLM to the and for 4> and see if 4. is is also consistent with this interpretation. Notice that observable. Hence, the slope of the model is CAREC and a fixed slope /3 NRL data = — \nd. the predicted slope sets. Unfortunately, is — we <j> from the model, In d. Model Validation we attempt to validate the GLM developed in Section 3. In Subsection the parameters of the model are estimated using the dilution series studies by and NRL. The GLM simplified pooling is adapted to the pooling setting and simplified model is validated on Cahoon-Young 4.1 Model The dilution series studies undertaken by et al.'s in Subsection 4.2. data in Subsection i and CAREC and NRL This data dilution level j. the values of the generalized linear model parameters a,/3 and 0. in the CAREC The 4.3. Fitting values Xij for positive sample embodied the fit very tedious to estimate a, Therefore, we will estimate a,/5 and close in value to is it pi0 known, but the dispersion parameter are unknown. In the next section, In this section, 4.1, the marginal change in response due to a change in the (3 is generate normalized will If the OD be used to estimate random mechanism GLM is the true process by which the data are generated, then the maximum likelihood estimators can be obtained by iterative, weighted least squares. normality assumption on X^ is However, the approximate, and we can relax this assumption by employing the theory of quasilikelihood functions (see pp. 323-352 of McCullagh and Nelder 1989). This theory applies under the following four conditions that are satisfied by the (i) the range of possible normalized level is specified as OD values X,j a function of the dilution 16 is known, level j, (iii) (ii) the GLM: mean normalized OD the variance of the normalized . OD is independent. Let us vector of normalized /ijj mean specified as a function of the fix OD and variance V(/i tJ ). sample i, readings, Then and 01), let and X = (A, t the observations are statistically (iv) , A,i, and assume that the A tJ . 's the log-likelihood function for . , X in ) denote the random are independent with /z tJ mean can be replaced by the quasilikelihood function where x tJ is the realization of A^. The maximum likelihood estimators (MLE) model for the parameters are then obtained by maximizing the quasilikelihood function QM =W"' -tw x for each study, where n — CAREC 12 for and n S-plus (see Hastie and Pregibon 1992) to obtain slope the /5 and the y— intercept mean response a,, i = 1, . , . r Since most GLM Figure 4. No scatter plots. are detected. to the data, scatter significant deviations The observations It is The parameter predicted values <fi, the jlij are and are given by l + (17) /?J. . diagnostics are visual, The the glm routine of for the dispersion we begin by analyzing the predicted vs. observed values and the Pearson residual plots. the logit link function. <16) NRL. We use 10 for each study. ^-)=d residuals are defined as ^.'C^y 10 for MLEs values predicted by the model, ln( The Pearson . = dv and residual plots from the predicted fit for Our analysis CAREC and scatter plot of the is illustrated using NRL (the dotted line) are observed in the are fairly uniformly spread along the fitted line worth noting that the traditional linear regression model and severe heteroscedasticity was are given in and no (1) outliers was also present; hence, the variance function fit V{n) appears to stabilize the residuals, giving an almost uniform spread of the residuals around zero. Moreover, as illustrated in Figure 5, the three link functions under consideration are 17 CAREC - NRL <0 o — o O) .2 O) .V o o O o • .X* > v o d 0.0 Figure 0.1 0.2 0.3 0.5 0.4 0.6 0.0 0.2 0.1 0.4 0.3 logit link logit link (a) (b) 0.5 0.6 Scatter plots for the response predicted by the logit, cloglog and probit link 5: functions. observed values in Figure nearly all The 6. the observed points The lie predictive within the is 95% many parameters as observations. is of freedom given by the The difference model parameters. Table 1 residual deviance is greater than 0.999 in fit is all Table 1 — In d is x 2 statistic for /3, 2 with degrees the MLE for the and the residual deviance. ascertained by the significance level of the x 2 statistic, which cases. do not contain -In 4 deduce that model that has as asymptotically x shows the 95% confidence intervals Recall that equation (13) predicts that in to the full between the number of observations and the number dispersion parameter 0, the degrees of freedom of the quality of the best by observing that twice the log-likelihood ratio. In particular, the goodness-of-fit GLM The verified confidence interval predicted by the model. can be assessed quantitatively by comparing the proposed of is be supplemented by quantitative diagnostics based on the visual diagnostics can residual deviance, which power of the model = -1.386 0= for —bid. Since the 95% confidence CAREC and -In 2 not an accurate prediction of the slope that best by deriving an upper bound on the deviance of the fixed slope 19 = for NRL, we GLM. However, -0.693 fits GLM, we the intervals can show that even be the set of mean response values predicted using the suboptimal estimates; then Q(p,x), which the quasilikelihood for the suboptimal estimates, is Therefore, Qq — Q(p,x) upper bound model for the logit is 46.2412 for and 13.617 1.0 respectively, for NRL. Since the we deduce that the quasilikelihood analysis can also provide useful insights into the ELISAs and normalized fixed Equation OD Xy, which level (18), the values of is quite small for is much HIV <{) is in Table HIV observation will be instrumental Subsection Central Limit accurate fit. 6.2. 1 , . ^M and Figure (18) 4 suggest that the coefficient of variation positive samples that have not been substantially diluted, whereas larger (near one) for Two mechanism of the reliability of the test outcomes. Consider the coefficient of variation of the - 0. CAREC GLM. The model provides a reasonable description of the data. The in bounded below by Q(p*,x). an upper bound on the deviance of the fixed slope is corresponding significance levels are 0.9996 and slope is As a negative samples or highly diluted positive samples. This in the development of the side remark, since Theorem employed variants of the in (6) model were is = |, Monte Carlo simulation model the value of k is at least 15, and hence the a reasonable approximation. also considered in a failed attempt to obtain a more Recall that (13) was derived under the rather crude approximation ln(l For high dilution this approximation ln(l +x)«i, we levels, will it + x) ~ the assumption p i0 S> p loo dJ (see equation (12)) underlying be violated. Employing the second order Taylor approximation tested the refined model ln(-ln(l -£[*„])) v ' =\n(^)-j\nd + k d3 ^. (19) pio Residual plots and scatter plots that are not displayed here indicate that for functions, this refinement has very little effect on the quality of the 21 model all fit. three link We also tested the alternative variance function V(//) The deviance functions. = 2 Ojjl {\ — = for the variance function V(fj,) than the corresponding deviance for V{n) = — /j) 2 2 (pii {\ 2 fi) on the <t>^{\ — GLM is n) with all three link significantly smaller indicating that the original variance , function provides a better description of the data. A 4.2. Simplified Pooling The complexity of the traditional model model The first where no dilution group testing policies for analyzing two simplifications of the simplified (i.e., GLM, combined with the complexity of the testing problem, tractable Model will GLM <f> in leads to an analytically in- that will allow for a tractable analysis in Section et al. data in Subsection 5. This 4.3. rather bold: Motivated by our earlier observation that the is Table captured) group ELISAs. Consequently, we propc be validated on the Cahoon-Young simplification dispersion estimates for effect is 1 we propose are small, to ignore the variability in the GLM and employ a deterministic model that provides a one-to-one mapping between normalized OD readings and antibody concentrations. racy of the Cu GLM Although this assumption compromises the accu- for the sake of tractability, the discussion about the coefficient of variation below equation (18) suggests that the resulting deterministic model should be reasonably reliable for the dilution of HIV positive samples at practical dilution levels. model leads to the simplified dilution h where the logit, (l^)- h (T)-"»* rather than the cloglog, function Recall that this dilution model where a given positive serum but is is is is diluted with a varying al.'s present a variant of our model that is 22 (20) being employed. appropriate for the not appropriate for Cahoon-Young et We now Our assumption CAREC and NRL dilution series, amount of a fixed HIV negative serum, data, which mirrors an actual pooled test. appropriate for pooled testing. Let a pool p consist of OD n samples that have individual normalized concentrations p\,... ,p n X Let - and p denote the normalized The only concentration, respectively, of the pool. and the pooling GLM that equation (12) is is difference replaced by p which is variability in this OD . . . level ,Xn and and the antibody between the dilution = (p { + . . . 4- antibody GLM (13) p n )/n. Repeating leads to the pooling analog to the simplified dilution model (20). Our second tration in (22) simplification by a Since lnp* = ln(-j^r) The concavity is to replace the logarithm of the average antibody concen- linear approximation, which yields ln (l v 1 +ln k = -On p, + p/ — Avs) n for i = 1, . . . ,n by . • - (22), + In n) - In k. we obtain the + ^n of the logarithmic function implies that ln( pl+ n the linearity assumption sera, GLM Xu GLM the steps leading from (11) to (13) gives the pooling and ignoring the readings is conservative in that it testing. characterization: the Our LOD average of the individual simplified pooling (logit of LOD readings. 23 ) > (lnpl+ ' '' l np " (24) has OD) ) n OD ; for the false negatives that result as model the normalized simplified pooling underestimates the and provides an upper bound on the number of quence of pooled (23) model hence, pooled a conse- an interesting and tractable reading for a pool is given by the . The model simplified pooling be validated on the Cahoon- Young (24) will The randomness in this subsection. the random walk model LOD in the individual to the pooling data, et al. data when embedded into for the pooling data. We readings, random walk model the deterministic pooling model, leads to a fit Model Validation of the Simplified Pooling 4.3. and use a nonparametric approach to test whether the increments of the random walk are independent and have zero mean. Recall that Cahoon- Young et individually tested 1280 samples, al. The nested pools of size 10, 20, 40 and 80. total and then generated sample contained 12 HIV positive individuals, and none of the pools contained more than one positive sample; hence, 12 of the 16 pools of size 80 contained exactly one positive sample. We only test the random walk model on the pools that contain one positive sample. In fact, the Cahoon- Young not tested at tested at LOD all all pool pool Let i = 1, reading for positive sample with positive sample positive sample i; i, . , . . i, Yij this pool consists of LOD i, Y^ Two incomplete: of the 12 positive samples were the ten samples that were will restrict ourselves to 10 index the ten positive samples. Let LOD be the and Y? denote the Notice that for fixed correspond to is Hence, we sizes. sizes. data et al. LOD reading of the j negative sample pooled s reading for the pool of size 10 x 2 containing samples Yn, are iid th random . . , Vi,iox2', variables for j = where s = 2, 3, . . . 0, 1,2, 3. , readings for negative sera; the assumption of normal not required in this subsection. Let /i = E(Y Yn denote the l] ) for j > 1. For i = 1, . . , . 80, since they all LOD 10 readings and s = 1, is 2,3, the simplified pooling model (24) implies that t^10x2" yr, = _ %v^ y i 1 2 = y v^10x2»- 1 2^ = 1 10 x 2 \yT,s-x + I i3 s" 1 ^ 24 (25) y ^pl0x2" ^j=10x2'- + 1 10 x 2 i l 'J (2gA s (27) where e, 5 three-step = — ' = " x 3 ' ,' tl x2 1 random walk is = the noise term. Since E(t xs ) Y q, (Y&, Y&, Y%, Y^) that starts at ^/i, equation (27) describes a ends at Y% and has x drift i/z. The following proposition shows that the autoregressive process (27) can be transformed into an equivalent driftless Proposition 1 random walk by For i = The proof can be found 1, . . , . in the by establishing that the 10, (V.o. Ki, K2i K3) s (Y? is — fi). when the pool Hypothesis I: walk. describes the Cahoon- Young et al. accomplished by verifying that the random increments are is Notice that the random walk only models pool size changes from 10 to 20, 20 to 40, and 40 to 80; pooling effect random a three-step driftless random walk model independent and have zero mean. effect as the 2 Appendix. The simplified pooling model (24) can be validated driftless pooling data. This validation Via = defining size changes from Random Independent The quired to pursue a statistical analysis. it the pooling does not capture the to 10. 1 Increments. A point estimate for /x is re- following proposition suggests that the estimator should be chosen to minimize the variance. Proposition 2 Define Since the true variance variance. sum An Vis (x) = is S 2 (Y?S not available, — x); then /z = minx E(Vls (x) — arg Vi0 (x)) 2 . we consider the estimator that minimizes the sample additional degree of freedom can be introduced by considering the weighted of squares 10 3 ZE^(V ts (x)-Vl0 (x)) 2 (28) . 1=1 s=l The weights Wi, u^, W3 can be chosen is minimized. The weighted ~ _ in such a way that the sample variance least squares estimator ft minimizing (28) - * His - yi M2'(l-2«)YS-(l i^ a =\\.uJs\^ E£i ^t=i E5,i \>- ioll.wi-2 and the following two propositions characterize 25 is 2')Y&}\ /i lQ fi ^ 2 3 ) } its statistical properties. of the estimator given by . , 2g Proposition 3 The estimator Proposition 4 The most W\ = w2 = The 0,11/3 = an unbiased estimator of [i for any choice of weights p, is estimator for the pooling data efficient now Vis — consider the process data tends to hence Vis —» Vis ) infinity, the Strong The independence assumption increments — s 2 (Y? Law As the number p.). of Large Numbers AV = V — K is implies that is |3 _! for s V random walk of the = 1,2,3. If will /j (and described which is a nonparametric procedure that The (1988). sample median and test is A run is having the same value. For example, the runs are separated as follows: low run count is the sequence , Chapter 4 of AV above the is 1 if iS is random components 1001001110111100110100, then 1|00|1|00|111|0|1111|00|11|0|1|00, is typical for and the run count is median reverting behavior. Note that under random the independence hypothesis, the run count for the fc in detail in an indication that observations below or above the median come together, whereas a high count with probability p is a maximal consecutive set of if . use the runs above or below the applied as follows. Let u ls take the value otherwise. Vls will the increments are independent, then with median A > be tested by studying the We Madansky — fi and extrapolate the conclusions to is probability | they are either above or below the median. test, of positive samples with probability one. Although we have only ten positive samples, we carry out a statistical hypothesis test on where p x = p3 = j and p 2 = \- Tk to be the Tk = 10. Then 3 * 0! P{T ,T2 ,T,) = . X vector (A1/tl Define vectors in our data set with run count k, where J2 k=l 1 X-J-2- 1 3' is obtained for weights is 1- in the 12. . proofs can be found in the Appendix. Let us Vi S iu t M T p , AV AV l2 number i3 ) , of 2 is k random (30) Pl* the significance of the observations under the null hypothesis. For our data, we calc. vations is P(Ti = 2, .ted T\ T2 = 5,r3 = 3) — 2,T2 = = 0.077. 26 5,T3 = 3; the significance of these obser- Although a p- value for this test cannot be obtained without ordering the state space, the probability of observing an outcome as extreme as this under the independence assumption 95% the independence assumption at the significance level. In fact, the along with outcomes (3,5,2) and (2,6,2), the Hypothesis for the mean of at least 0.077; hence, is mode AVlU AV and l2 AV l3 by 0.2065 are given reject outcome (2,5,3) is, of the distribution of (Ti,T2 ,T3 ). Zero mean random increments. II: we cannot ± The 95% confidence 0.2529, 0.08032 ± intervals 0.4961 and —0.3491 ± common point of the three intervals. In conclusion, the data support the hypothesis that the 1.0637, respectively. random walk The zero mean hypothesis cannot be rejected, since zero is a (27) provides a realistic description, thus establishing their consistency with the simplified pooling model (24). The Derivation 5. In this section, framework to find of we embed the efficient and its stop testing and classify all policies. false positives LOD reading Policies simplified pooling pooled testing weighted cost due to testing, fied size is tested Group Testing is and Our objective false negatives. determined. The individuals in the pool as samples), stop testing and classify all model (24) into is to minimize the expected Suppose a pool of a decision HIV an optimization maker has three HIV positive (and discard these samples) or divide the pool into subpools for further testing. There are many ways to subdivide the pool under the third option, and we consider a quite general dure can be modified slightly to allow unequal subpool in Hwang (1984) and the T£(V) procedure in Litvak et For a given gorithm is initial developed in options: negative (and transfuse these individuals in the pool as multistage policies employed by Arnold, where each subpool speci- is of identical size. sizes, as in possible class of Our proce- the sequential procedure al. pool size and subpool configuration, a dynamic programming al- Subsection 5.1 for finding the optimal policy within the class of Exhaustive search among alternative multistage policies under consideration. and subpool configurations sizes is required to find the cost minimizing policy. is computationally intensive and the resulting policy plement, a procedure for deriving near optimal Dorfman policies is Structural dynamic pro- properties of the optimal policy are investigated in Subsection 5.2. Since the gramming algorithm pool initial im- difficult to derived in Subsection 5.3. The Dynamic Programming Formulation 5.1. We assume that the blood donor population negative (denoted P_) and are assumed to be deviation The positive (P+). fit composed LOD normal random variables with mean (a+, respectively). This assumption cr_ a reasonable that a iid HIV is to the data in Figure random donor HIV is 2. is of two subpopulations: readings of P_ (P+, respectively) /i_ (//+, respectively) and standard GLM and provides consistent with the The known seroprevalence random sample of n\ = is individuals and is denoted by {Y(ti,..„ ,In), and classified indexed 1 < individuals. Based on negative or HIV Y = 1, Oi,.. , Y , decide whether the all positive; if so, stop testing. subpopulations of size n^ ii the probability . , way < 1 that the ijv < LOD Q/v}- from Blood sera collected is reading for every sample The individuals are tested according to the following multistage screening procedure (see Figure 7 for a simple example): Start by obtaining with such a in < ii is fljli &} individuals the donor population; the a/s dictate the subpool configuration. all ri\ ir Arnold's notation will be adopted to describe the positive. multistage testing procedure. Consider a from HIV = 11^=2 a j> the second with for all subpopulations. i\ = w ^ tn the LOD reading of the pool composed of all individuals in the pool can be classified as If HIV not, then subdivide the population into a x first subpopulation consisting of 2 and so on. Obtain the LOD For each subpopulation, decide whether 28 n\ readings Vi ( 1 ) all individuals all , . . . , Vi(ai) individuals should be I . . Y<1,1> Y(l, Y x Y(2,l) Y(2,2) 2) Y (1) Y(2,l) Y(2,2) Y(l,l) Y(l,2) Y2 ( 1 (2) x |Y 2 (1,2) , 1) 7\ Figure classified as HIV 7: A simple example of a multistage group testing procedure. negative or positive based on the pair (Y ,Yi(j)); if so, stop testing. If not, subdivide those subpopulations that require further testing into a 2 subpopulations of = size 713 YijL3 aj- Continue positive, or stage in this vein until either all pools are N, where individual be equivalently described by a rooted different testing is used, deemed HIV negative The reached. is testing or scheme can where the nodes of the tree correspond to the tree, subgroups formed during the procedure. According to the simplified dilution model defined by Y^{i\,- ,In) = Y(i\,. . , z'jv) LOD (24), the readings are inductively and Yj_i(ii,... ,ij_i) = (31) a. The state of the system at LOD readings obtained thus posed of all = 1, . , . . far. individuals with the by (Yo,Yi(ii),... ,Yj(ii,... j any stage of the screening process can be described by the ,ij)). N, and denote the the current state of the system If first To the pool that j indices given by simplify notation, state of the system is Sj is and j < by Sj N — 2, 29 currently being screened is com- then the state is given i x . . ,ij, . , we shorten Yj(i\,...,ij) to = 3 ) {Y , . . , Y for j = 0, . . . Y3 , N. for If then three decisions are possible: Either declare all individuals in the pool as negative and stop testing, declare all individuals in the pool as positive and stop testing, or subdivide the pool into aJ+ \ subpools of size continue testing. Under the first positive individual in the pool, HIV incurred for each cost c(n J+ 2) is and under the second the same notation for the decision at stage false negative cost Cfn we If let is HIV Jj(Sj) programming algorithm = random 0, . . . , N— 1. is 1, we can adopt (individual testing), the false positive cost is when cpp or the in state S 3 at defined inductively by min {c FP P(YN e P-\SN ),cFN P(YN e P+\S N )} (32) , = Tmn{aj+ i(c(nj+2 ) + E[Jj+l (Sj+ i)\Sj]), a j+i = and the positive or negative, N Jj(Sj) denote the optimal cost for stages j through TV JN (S N ) j At stage 1. cpp decision, a testing = defining a^+i HIV incurred for any individuals that are misclassified. stage j, then the dynamic for N— incurred for each Under the third By incurred for each of the a)+ \ subpools. is decision, a false positive cost negative individual in the pool. individuals are classified as cp^ decision, a false negative cost nJ+ 2 and a/v £ cfn £ ••• cfp E ...f^P(YN (i ll ...,i N )eP-\Sj )} Because the individual P(YN(iu...,iN)eP+\Si), LOD (33) readings of each sample in a pool are iid variables, equation (33) can be simplified to Jj(Sj) = Tmn{aj+1 (c(nj+2 ) + E[Jj+1 (Sj+1 )\Sj]), nj+1 cFN P(yj where V} is eP+ \S ),nj+1 cFP P(Y j j eP..\Sj )} a random variable denoting the individual the pool at stage LOD for j = 0,...,N - reading of a generic 1, (34) member of j. Since the state of the system at stage j is given by the stage j, the dimensionality of the state space grows as the 30 LOD readings obtained through dynamic programming algorithm proceeds; hence, the algorithm in (32) and (34) cannot be efficiently used for numerical The calculations. LOD recent following proposition shows that reading. Proposition 5 The latest We LOD we need only keep track of the most state of the system at every stage can be adequately described by the reading. prove this proposition using Corollary 2 of Arnold, which is stated here for completeness. Corollary 2 (Arnold, 1977) The conditional distribution ofYJ+ given Sj is \ the conditional distribution ofYj + i given The following Lemma on S: 1 lemma, whose proof is only through so that Jk(Sk) = Y the 2 ; Jk(Yk)- same 5. true for P(Yj is The By (34) we can Yj G Rj then transfusion, also needed: proposition can be proved by induction on the dynamic and is true for j Lemma 1, Jk-i if Yj all by a samples in the pool are G Rj then = all samples are CF { fit = {Yj f+\Y) : a function of from Corollary classified as The c FP n J+2 T }, G P.\Yj) < 31 = k, P(Yk G P+\Yk) Jjt(V/t), 2. reading Yj, and the optimal R* and Rj HIV HIV true for j it is for j = 0, . . . , N such negative and released for positive and discarded, and critical regions are defined c F p{l-n)j P (Yj LOD classified as {>':^<-^p(\-TT)) f+(Y) I is N; assume that set of critical regions otherwise additional tests are carried out. flS = replace the state Sj by the latest decision rule can be described if is G P-\Sj). P_|YJt); hence, the proposition follows Therefore, that Yy given in the Appendix, programming algorithm. The proposition P(Yk G as There exists a version of the conditional probability P(Yj G P+\S3 ) that depends Proof of Proposition and same the by (35) min {c FN n ]+2 P R~ = {Yj : (Yj c FN n J+2 P € P+ \Y ) 3 ,c(n J+2 ) P.\Yj) ,c(nJ+2 ) where /_ and /+ denote the normal densities for the ulations, respectively. Notice that the critical region test. [Jj+1 (Yj+l ) \Yj]}} +E[Jj+1 (Yj+1 ) \Yj\}} (37) , < fc e P+ \YS ) rmn{cF pnj+2 P (Y3 € hypothesis +E HIV R^ negative and HIV (38) , positive pop- maximizes the power for a simple Therefore, by the Neyman-Pearson lemma, the proposed classification pol- icy at the individual testing stage not only minimizes the cost for the particular choices of Cfn and c F p, it also minimizes the type II error (false positive) for a fixed level of type I error (false negative). 5.2. Structural Properties of the Optimal Policy Intuitively, terized lYj : Yj by a < one might expect that the optimal {cj,c+ set of constants R* = c~\ and lYj : Yj > < : j classification policy could < N} (where cjj = be charac- c%) such that Rj — cf\. Such a classification policy for a generalized group testing procedure will be called a cutoff policy. Arnold obtained sufficient conditions ensur- ing the optimality of a cutoff policy for a simpler group testing problem that possesses only two possible The the LOD classifications. Here, we extend his results to the model following monotonicity notion was introduced in Arnold: reading Yj has the Mon(j') property conditional expectation E{h{Yj)\YJ _\ proposition, which is proved in the = s) is if for all in The 5.1. density g3 (yj) of nonincreasing functions h(y), the monotone nonincreasing Appendix, provides Subsection in s. The following sufficient conditions for the optimality of the cutoff policy. Proposition 6 A cutoff policy is optimal if the likelihood ratio -r- ing and the density gj(y) ofYj has the Mon(j) property for all j. 32 is monotone nondecreas- The Mon(j) cannot be used definition of whether a density has the required for testing property. Instead, the following proposition can be employed (see the Proposition 7 The density g} {y) has y\Yj-i Mon(j) property the if for Appendix all for a proof). j and all y, P{Y} < = s) is a It turns out that neither of the sufficient conditions in Proposition 6 are satisfied by nonincreasing function of the normal density with our data. Recall that /_ and /+ normal with mean is is Y^ = /i+ larger than the variation in P_, readings s. *f- is 0.8 and a+ = mean random is —4.82 and not monotonically nonincreasing. variables, each of that the distribution of Vjv|Yjv-i = which is cr_ Because the variation 1.08. By are distributed as a mixture of two normals. collection of iid /i_ The (31), Vjv-i is a mixture of normals. = in individual 0.42, P+ is LOD the average of a can be shown It a more complex mixture of normals that does not satisfy the Mon(_7') property for our parameter values. Although neither condition in Proposition 6 is satisfied by our data, the optimal cutoff policy performed nearly as optimal policy in the computational study described 5.3. The Dorfman In the sample testing. for in Due to its is deemed HIV simplicity mass screening programs. et al. and Kline et al.) used to reduce the cost of next section. Policy Dorfman procedure, a pool the pool in the well as the overall and of a specified size is tested, after which either every negative, or every sample in the pool undergoes individual effectiveness, this procedure is frequently used in practice In particular, recent field studies (e.g., Behets et in developing countries HIV screening. such as the one described in Subsection Therefore, the improvement achieved by the renders of general group testing strategies, them more vulnerable more complex by the human errors incurred during implementation. 33 Emmanuel demonstrate that such procedures can be The complexity 5.1, al., to human error. testing strategy could be offset Using the dynamic program of Subsection rule for a Dorfman procedure with pool n by size we can obtain the optimal decision 5.1, setting TV disallowing the option of discarding a pool that contains — 1 and n\ = a x = n, and more than one sample. However, numerically solving the dynamic programming algorithm requires a discretization of the state space of LOD and can be cumbersome and computationally readings, method we propose a relatively simple method on two simplifying assumptions: relies individuals in the pool are transfused threshold, and each sample a second threshold outcome for obtaining a near optimal Dorfman policy. the if a cutoff policy (i) LOD is employed (that reading of the pool is of the pooled test below a certain HIV positive or negative), and The first assumption is clearly not making the most Consider a Dorfman policy of pool Let Yi be the size n. group P+ LOD reading of the Suppose that x testing. If is th i size efficient n applied individual and Yp in use of the pooled be the LOD an individual c(l) Let Akn be the event that A: + c FP (l - tt) Jx Cg (z) = c(n) the cutoff for readings for P_ and + c FN n / J f+(y)dy. (39) —<x> out of the n individuals are the group testing stage of the process LOD is rx f-(y)dy / reading. test is /+00 = OD reading of the poc the cutoff employed for individual testing and z respectively, then the cost of However, a seroprevalence n population. /_ and /+ are the probability densities for the d(n,x) the (ii) not very restrictive, particularly since cutoff policies are the only policies that are apt to be adopted in practice. is the is, used to calculate the posterior seroprevalence, but not the is posterior conditional densities. the second assumption Our in the pool is individually tested otherwise; in the latter case, used to classify individuals as is intensive. Therefore, HIV positive. The cost incurred at is + c FN J2 P(A lm )P(Y p < z\A kn )k, (40) fc=i where the first term negatives. Since Yi, the testing cost and the second is . , . . Yn are iid, it follows that 34 is the misclassification cost of false P{Y e P+\A kn = l ) £• Under our second assumption, the cost incurred at the second stage of the testing procedure C ig (z,x) £ P(y = p > z|.4 fc n)P(/W)nC;(-,:c); C(n,x,z) = is c(n) C(n,x,z) + c FN The proposed Dorfman procedure be solved among in two stages: = ^[Cg (z) + Cig (z,x)}, £ 7r*(l "j_ - n-k ir) or kP(Y p < z\A kn ) given by the solution to min ni z C(n,x, z), which can Obtain the optimal cutoffs x and z for every n, and then search the integers for the optimal group size n. Under our positive fc„ is (41) n k=o hence, the cost per individual is probabilistic assumptions, the and n — k HIV negative individuals + +(n-fc?„- and variance a 2n = Thus, a locally optimal solution is is k<rl+(n-k)<rl reading of a pool composed of k normally distributed with mean that .^ , mal cutoff points are obtained by solving the LOD first ^^ 2 _ N(^ kn ,a kn ). HIV fikn The = opti- and second order optimality conditions. obtained, which turned out to be globally optimal in our numerical studies. 6. Computational Results In this section, we assess the relative performance of four testing policies: individual testing (with optimal cutoff values derived from equations (35)-(36)), the heuristic policy developed in Subsection 5.3, the optimal Dorfman policy derived from the dynamic programming algorithm and the optimal generalized group testing policy. A scenarios are considered by varying the seroprevalence and false negative cost. of the optimal policies is is described in Subsection 6.1 and the specified in Subsection 6.2. 6.3. In Subsection 6.4, The policies are tested we apply our model wide range of The derivation Monte Carlo simulation model on the simulation model to the data from N'tita et 35 Dorfman al. in (1991). Subsection 6.1. Computer Implementation The heuristic Sun Sparc station Dorfman The 20. policy in Subsection 5.2 partial derivatives of was implemented using Maple on a C(n,x,z) with respect to x and z were obtained using symbolic differentiation, and the stationary points were identified using the built-in routine solve. Only stationary points lying in the rectangle [/x_,/i+ ] x [/i_,/i+] were considered, and these points satisfied both the A search over the integers in all cases. The optimal group size first and second order optimality conditions was then employed to obtain the optimal was found to be bounded above by 20 for group size n. n > 0.001 and cfn > 100; hence, the search was restricted to this region and a procedure similar to interval halving was used. The implementation of the dynamic programming algorithm is more complex. The continuous state space must be truncated and discretized: our state space consisted of 200 equally spaced points in [/i_ - 6cr_,/i+ -I- 6er+], with step size 0.074. Simpson's numerical integration rule with fixed interval size was employed to achieve four digit accuracy. The Simulation Model 6.2. The analytical model in Section 5 assumes that and employs the simplified pooling model sufficiently realistic to provide Monte Carlo simulation model for this problem in a pooled LOD randomness in the however, we believe that this model more realistic model. a nontrivial task: There are two possible sources of uncertainty reading, the variability in the individual antibody concentrations manner OD not However, building a simulation in which antibodies are detected by ELISAs, and assess the relative impact of each source. Moreover, the pooling the normalized is a reliable assessment of the policies. Therefore, we resort to to obtain a is (24); LOD readings are normally distributed level of GLM (21), it is and the difficult to which predicts a pool as a function of the antibody concentrations of the indi- 36 viduals comprising the pool, cannot be directly simulated because the underlying antibody concentrations are unobservable. Consequently, we test the policies on two simulation models of varying complexity. LOD The simpler simulation model randomly generates individuals from the empirical distributions of deterministic pooling model (22). Taking sample's antibody concentration p t Substituting ^jj- for is n = readings for positive and negative Dax that appear in Figure 2(b) and uses the 1 in equation (22) implies that an individual related to its OD normalized This simulation model (22) does not X by p t = x rrv:- pi in (22) gives and hence the value of the parameter k need not be estimated model level is more realistic employ the linear for the simulation than the analytical model approximation embedded in in model. two ways: the pooling model and the (24), LOD readings are drawn from the empirical distributions rather than the normal distributions. Although the simulation model ignores the stochastic component of the the binding mechanism of ELISA, both the The more complex model antibody concentration and the embedded LOD is GLM in the empirical distributions indirectly captures the second source of uncertainty. simulation model attributes the variability of the to the variability of antibody concentrations and arising from variability in the uncertainty due to the binding mechanism are of Dax; hence, the simulation GLM LOD and the stochastic component readings of the GLM, derived from the additional assumptions that (a) the stochastic component of the is negligible for negative individuals is HIV positive individuals deterministic. The first and antibody concentration (b) the assumption is appearing near the end of Subsection 4.1 that the normalized in HIV motivated by the observation OD reading for a HIV positive individual with a given antibody concentration has a very small coefficient of variation. justify the second assumption, we recall that the normalized OD reading for a HIV To negative individual with a given antibody concentration has a coefficient of variation roughly equal The normalized to one. OD readings for HIV negative individuals in Figure 2(a) have mean 0.0083 and standard deviation 0.0085, and hence coefficient of variation 1.016. Therefore, the variability of the normalized OD HIV readings of uncertainty in the binding mechanism that is negative individuals is mostly due to the captured in stochastic component of the GLM, and consequently the variance of the antibody concentration of HIV negative individuals can be approximated by zero. HIV Let p_ denote the deterministic antibody concentration of the uals. To estimate p_ from the data, notice that equations (10) and negative individ- (11) (with the logit function replacing the cloglog function in (11)) imply that the normalized HIV negative individuals are N \-j^rr, (fcHfp _ )2 By )- setting the mean and OD readings for variance of this normal distribution equal to the respective mean and variance of the empirical distribution in The Figure 2(a), we obtain two equations and two unknowns. tions is = p_ 0.968 and k = 115.26. The large discrepancy the estimated value of about 20 from Table may be due 1 obtained from individual testing data and the other generate the random antibody concentrations sample normalized OD readings invert equation (20) to obtain Pi tions (10) and X l p, for is between the to the fact that t (21) to calculate the normalized ) OD latter value and one estimate is obtained from a dilution study. To HIV positive individuals, from the empirical distribution = kXJ(\ — X = solution to these equa- 115.26X /(l — t reading X for Figure 2(a), and then in X we randomly t ). Then we use equa- a pool of size n, where the antibody concentrations of the n samples are generated from the seroprevalence and the two distributions specified It is the earlier. not clear to us which of the two simulation models more complex model incorporates the distribution may stochasticity in the lead one to favor the simpler model. 38 It is is more GLM, its realistic; although use of the normal reassuring to report that the simulation results for the two models are qualitatively nearly identical and quantitatively very similar (expected total costs are within we only report the results for the al. Without testing a single sample is estimated. the next subsection, briefly comment on the The To 1, and the false positive cost we normalize these cpp and the must undergo two additional ELISAs. false with a positive initial ELISA is more than ELISA underestimate the true 1 Red Cross individuals. Hence, cost may be we have chosen = 1.35+0.04n cannot be as easily one of the additional tests is ELISA positive, used to verify the individual's serological status. is — 0.99 2 The Western . protocol false positive cost human c/r/v c(n) positive during an initial and labor than an ELISA is 2 + 10(1 — test. may approximately (assuming Blot test is approximately Hence, the expected 0.99 2 ) because successive be independent, and the Western Blot test the latter case, a HIV a Western Blot test test requires ten times as costly in materials is 0.99, the probability that a noninfected individual results are independent) positive cost under the is negative cost n we note that under the current Red Cross at least If then a highly specific test (Western Blot) costs so that the cost of cost of testing a pool of size get a rough estimate for cfp, Since ELISA's specificity cost of and the cost of testing a pool containing n > 2 samples loss of generality, = c(l) detailed cost estimates contained in the screening protocol, individuals that are found to be successive in They estimated the material and labor are employed. to be $2.87 + $0.083n. 2. The describe the model parameters. study of Behets et n > and then results for the simple simulation model, testing a single sample to be $2.12, for Hence, complex simulation model. Now we field 5% of each other). = ELISA 2.199. results are not likely to not be available in developing countries; incurred, particularly if difficult to quantify. 39 in test results are reported to the conservative estimate of cfp and are very may This cost = 5. Since a false negative cost will contaminate the blood supply, these costs are larger than false positive costs false Therefore, much we consider , four different values for (100, 1000, 5000 cfn different values for the seroprevalence and and combined them with seven 10,000), (ranging from 0.001 to 0.15) to generate 28 different tt scenarios that span a broad range of possible settings. For each scenario of the simple simulation model, we randomly generated sample LOD readings using the seroprevalence simulation terminated at the 95% first ir and the normal distributions time after 10,000 simulated pools when the width of the To avoid the confidence interval for the expected cost dropped below 0.2. of sequential dependencies due to any inherent deficiencies of the possibility Turbo Pascal random generator, the ranO routine described in Chapter 7 of Press (1988) was used. were tested on the same random sequence of policies 6.3. LOD Policies. We readings. begin by comparing the individual testing policy, heuristic Dorfman policy and optimal Dorfman policy; later in this subsection, the generalized testing policy will be considered. Before assessing the policies' performance, optimal Dorfman policy turned out to be of the form: Continue at the is arises because under the proposed normal the far of the HIV left tail positive LOD negative reading. However, this and does not occur in practice (with low first group we note that the stage if the LOD either above a cutoff point or below a second, extremely small, cutoff point. This awkward form HIV All four Simulation Results Dorfman reading The specified earlier. in Figure 2(b). good reason), and LOD readings, LOD distributions with reading eventually dominates the far phenomenon is due solely to a+ > left tail <r_ of the our normality assumption, Moreover, such a policy would never be implemented so we disallowed the option and only report the performance of the within the class of cutoff policies defined in Subsection of continuing for extremely Dorfman 5.3. The policy that was optimal difference in performance between the overall optimal Dorfman policy and the optimal cutoff Dorfman policy was very 40 small in our numerical study, and hereafter the optimal Dorfman Our main we refer to the optimal cutoff Dorfman policy as policy. results are reported in Table 2, which displays their performance which describes the The the simulation study. in policies, first and Table column Table 2 in enumerates the 28 scenarios, and the next two columns characterize the scenarios. The column gives the pool size for The remaining columns and cutoff points for the individual testing policy, pooled testing stage and the second stage is For each scenario, Table 3 gives the total cost of each policy. employ for both stages (the give the first stage the individual testing stage) of both procedures. (1) final each scenario, which was identical for the optimal Dorfman procedure and the heuristic Dorfman procedure. The 3, 95% LOD is the Dorfman confidence interval for the expected following three observations can be extracted from our numerical study: The optimal and identical group heuristic sizes for Dorfman procedures are quite similar. They both each scenario, and their cutoff points for each stage are relatively close in value in Table 2. Rather surprisingly, as seen procedure outperforms the optimal procedure in in Table 3, the heuristic 23 of the 28 cases, and the expected cost reduction for the heuristic procedure relative to the optimal procedure averages 8.1% over the 28 scenarios. As seen in Table 2, the optimal procedure is slightly more conservative more the choice of cutoff in the pooled testing stage, resulting in policies that are in sensitive, but require more testing. For low seroprevalence, the optimal Dorfman procedure seems to overcompensate for the dilution effect, so that the the resulting increase in monetary testing cost. Dorfman procedure is noteworthy, since this policy improved sensitivity does not counteract The strong performance is much of the heuristic easier to derive than the optimal Dorfman procedure. (2) Group testing is optimal for all 28 scenarios, and significant savings over individual 41 testing are achieved. The expected to individual testing ranges from scenarios in Table 3 is cost reduction for the optimal 5.9% is 40% from 7.5% to 79.2% and averages 43.4%. The monetary and 46% for the optimal policy not show the numbers in Table sensitivity and 3, both Dorfman procedures are highly sensitive and and 99.7% for the heuristic optimal Dorfman procedure. for the Moreover, although we do for the heuristic policy. specificity over the 28 scenarios are individual testing policy, 99.8% and 99.7% cost reduction over the 28 also significantly reduced; the average reduction relative to individual testing is The average and the average relative 39.3%. For the heuristic Dorfman policy, the expected cost reduction relative to individual testing ranges testing cost to 77.4%, Dorfman procedure specific. 99.7% and 99.7% for the Dorfman procedure, and 99.8% Moreover, the sensitivity of the Dorfman procedures never dropped below 99%. (3) In Table 2, the optimal cutoff values for individual testing are very similar to the optimal cutoff values for the individual testing stage of the two and range from -3.1 to -3.7; (recall that //_ However, the optimal cutoff values are much maintain its is more conservative same test kit employed To at er_ = 0.42, //+ = 0.80 and a + = 1.05). pooled testing stage of the two Dorfman procedures Hence, the Dorfman procedure is able to high test accuracy by a judicious choice of cutoffs at each stage; more specifically, In contrast, previous (field that the the by the -4.82, lower, ranging from —4.4 to —4.8. the cutoff level effect. for the = Dorfman procedures, cutoff level is at the pooled testing stage to and statistical) researchers in compensate for the dilution pooled testing have assumed used at both stages; in particular, the cutoff level proposed manufacturer, which presumably is close to optimal for individual testing, is both stages of the Dorfman procedure. assess the performance of the traditional in previous studies, we assume that the optimal the fourth column of Table 2) is employed Dorfman policy that has been considered cutoff level for individual testing at both stages of the procedure. 44 Under (i.e., this 2 assumption, the optimal value of the pool size n was derived using the cost function (42). The optimal pool size was 15 for scenario 1 and two for the other 27 scenarios. expected cost reduction relative to individual testing was 12.78%, which 39% than the To to 43% illustrate the predictive smaller power of our model with respect to the traditional Dorfman seroprevalence of the 8000 samples was 2.44%. The HIV where the in Kishasha, Zaire, et al. traditional Dorfman procedure with screening by 56% relative to individual low reactivity individuals were not detected. We used the Monte Carlo pools of size ten reduced the monetary cost of testing; however, six much reduction achieved by the proposed Dorfman procedures. we consider the study carried out by Behets policy, is The average simulation model to calculate the performance of the traditional Dorfman procedure that employed the individual testing cutoff of scenario 15 of the procedure were false negatives. both stages. The expected sensitivity 96.4% and the expected monetary testing cost was 0.40 our testing cost function c(n) analysis predicts a at 60% is based on the cost model Behets et in al.); (recall that hence, our reduction in monetary testing cost and (0.025) (0.036) (8000)=7. Therefore, the model accurately captures both the magnitude of the cost savings and the extent of the dilution effect as manifested by the low reactivity individuals that are not detectable in pools. expected monetary testing cost number heuristic of false negatives is is Under the 0.43 heuristic and the Dorfman sensitivity is (0.025)(0.0019)(8000)=0.38. policy for scenario 13, the 99.81%, and hence the expected Therefore, we predict that the Dorfman procedure would not have had any trouble detecting the low reactivity individuals. Additional scenarios were considered to generate Figure 8, which provides switching curves depicting the optimal group size (as calculated by the heuristic a function of both the seroprevalence and the is a decreasing function of both quantities; if false as negative cost. As expected, the group size the seroprevalence 45 Dorfman procedure) is high, then large group sizes will contain cost is HIV positive individuals with high probability. if the false negative high, then smaller group sizes are required to diminish the impact of dilution. Notice that groups of size two or three are never optimal. This testing cost c(n) savings realized = + 0.04n: The 1.35 when the group size binary tests, where group testing less Similarly, than (3 — VE)/2 « is is cost of constructing the pool two or optimal seroprevalence in Figure 8 is if and only if the proportion of defective items The breakeven (between in is c(n) becomes increasingly important more conservative choice results for the is individual = is only optimal in Figure 8 for and group > 1 for all n) testing) 100 breakeven seroprevalence between 0.18 and 0.382 the form of our pooled testing cost (the traditional cost The larger than the limited a nonincreasing function of Cj?n, and for c^/v than or equal to 0.18. The gap of statistical errors, which is due to the pooled is three. Finally, in contrast to the case of perfect 0.382 (see Ungar 1960), group testing significantly lower seroprevalences. leads to a phenomenon it is less due to is and the presence as seroprevalence increases and of group size. complex simulation model described in Subsection 6.2, although The not shown here, are consistent with the results from the simple simulation model. average sensitivity and specificity over the 28 scenarios are 99.8% and 99.7% for the heuristic, and 99.8% and 99.8% for the optimal. Relative to the individual testing policy, the expected cost reduction over the 28 scenarios 46.1% for the heuristic is 42.4% Dorfman procedure, which values in the simple simulation model. complex simulation model occurs The only in scenario 4, for the average optimal Dorfman procedure and are slightly larger than the corresponding more qualitatively different result for the where the low seroprevalence and high false negative cost lead to a very conservative cutoff at the pooled testing stage for the optimal Dorfman procedure (see Table 2). individual tests and fared poorly. -4.76, the Consequently, the optimal procedure performed too When monetary testing cost drops the optimal pooled cutoff is increased from -4.801 to drastically, while the sensitivity 46 many and specificity remain 1 0.1 0.01 o in 0.001 00001 . for two scenarios. For scenario initial pool size n\ = 4, 19, and consider the subpool configuration where the optimal Dorfman pool the subpool configurations (a) a3 = 1.285 2. where the optimal Dorfman pool size N= is 8, 2, a.\ we = 2 let size in Figure 7. the initial pool size and a 2 = 4, and is = ri\ N— (b) four, we let the For scenario 6, and consider 8, 3 and a { = a2 — For scenario 19, the expected total cost of the generalized multistage procedure was ± 0.0098, scenario 6, which higher then the cost of either of the proposed subpool configuration total cost of 0.3558 the optimal although is ± (a) outperforms configuration 0.014. This corresponds to a 5.6% expected Dorfman procedure, and a 0.6% reduction over the it is (b), Dorfman and policies. For yields an expected cost reduction relative to heuristic policy. In summary, possible to obtain generalized multistage policies that outperform the Dorfman procedures, the additional improvement appears to be offset by the difficulty in deriving and implementing these 6.4. policies. Application The numerical results in Subsection 6.3 (for example, the switching curves in Figure 8) cannot be universally applied for several reasons. Our numerical results depend upon the HIV positive and HIV negative distributions, which seroconversion rates (a larger seroconversion rate readings for HIV positive the testing cost c(n) differ across lead to a fatter we will loosely the world, due to left tail individuals) or the particular strain of virus that may depend upon country. Nevertheless, may may is of the LOD prevalent. Also, various economic factors that are distinctive to each apply our results in a documented setting to obtain a rough estimate of the benefits that are achievable from group testing. In Kishasha, Zaire, of the 3741 units of blood transfused in February 1990, 1045 (27.9%) were not screened for HIV infection (see N'tita et al.). Assuming that consequence of budget constraints, we can propose an alternative strategy that 48 this was a will reallocate funds across the transfusion centers so that every blood donor can be tested for antibodies to HIV. Since 72.1% of the units were individually currently implemented policy 2.5% Behets (see et is was employed on 72.1% of the policy are 99.8% and 99.6%, transfused is positives + units. Seroprevalence in Zaire (0.025) (0.002) (2696) = 10.24. If scenario 15, then the monetary testing cost the expected specificity transfused is is cost of the estimated to be about Since the expected sensitivity and specificity of this respectively, the expected (0.975) (0.004) (2696) is monetary testing Suppose that the individual testing policy under scenario 15 al.). (0.025)(1045) 0.721. tested, the is 99.6%. (0.025)(0.002)(3741) is = number 26.25, we use the 0.54, the of infected units that are and the expected number of heuristic Dorfman expected sensitivity false policy under is 99.8% and Hence, the expected number of infected units that are = 0.18, and the expected number of false positives is (0.975)(0.004)(3741)=14.6. In summary, pooled testing in this setting reduces the monetary testing cost by 25% and reduces the expected number of infected units transfused from 26 to essentially zero. It is clear that pooled testing, if used properly, can save hundreds of lives worldwide. 7. Concluding Remarks We effect that have developed and validated a mathematical model that captures the dilution occurs when HIV ized linear models (13) and positive sera are pooled with (21) develop new HIV negative sera. The general- insights into the nature of the dilution effect, and avoid the heteroscedasticity problem that has plagued the traditional regression models obtained through a purely empirical approach. These may be model (24) proach may be in GLMs and the useful for other applications besides group testing, applicable whenever pooled testing a liquid. 49 is simplified pooling and our general ap- used to identify a disease or contaminant Our numerical results suggest that the heuristic 5.3 provides a cost-effective, accurate mented HIV screening and Dorfman policy derived Subsection relatively simple alternative to currently imple- protocols. This policy can be used in developing countries to safeguard the integrity of the blood supply, and consequently reduce the spread of the While existing in field studies AIDS epidemic. and mathematical analyses assume that the same cutoff point is used to classify the pool at both stages of the Dorfman procedure, our analysis shows that only by selecting a different cutoff point at each stage can we ensure that the sensitivity of the test is not compromised. Finally, estimation; in a HIV testing is also extensively used for seroprevalence companion paper, we show how the pooling model developed and validated here can be employed to derive efficient seroprevalence estimates. Acknowledgment We are very grateful to Barbara and Richard George for Cahoon- Young, Elizabeth Dax, Esther de Gourville We providing data. also thank Karla Ballman, Barbara Cahoon- Young, Elizabeth Dax, Richard George, David Heymann, Richard Kline, Eugene Litvak, Sheila Mitchell, Peter Page, Constantia Petrou, Chris Stowell, Hiko Tamashiro, and Guido van der Groen for helpful discussions about various aspects of pooled testing. This research is supported by National Science Foundation grant DDM-9057297 and American Foundation for AIDS Research (AmFAR) grant 02100-15-RG. Appendix Proof of Proposition 1. By definition, for s Vi>a+1 = = 2* +1 2 = 0, 1,2, (V£ +1 -aO (44) l s+l ( -Y?3 + e hS+l -n) = Vis + 2s (2ei>3+l -f J> 50 ) (45) (46) = Vi3 + ciiS+1 where Ci,,+i is a random variable with zero mean. Proof of Proposition form, the (47) , minimum E(Vu (x) - is 2. Since derived from the Ko(x)) 2 E(V j(x) — Vl0 (x)) 2 l first V'l0 = E[(yu - Ko) s (2 positive semi- definite quadratic order conditions. = E[(Vis - + is - - We have 2 + (2° 2 + 2E[(VU - ViQ ){2* - \)(jjl x)) (48) } ] -l) 2 (fi-x) 2 l)fji - x))\ (49) , and therefore ^E(Vu{x)- 2 -2(2 5 -l)E[(K,-Ko)]+2(2 i -l) 2 (x-/i) = Vio(x)) = 2(2'-l) 2 (x-^), since E(Vi3 = E(Vi0 ) by ) Proposition Proof of Proposition and using Proposition 1, i.e., 3. The Proof of Proposition ) 4. We want to obtain the n x = /z. E(p.) in equation (29) ^3 , ^ r. set of nonnegative weights Wi Equation (29) can be reexpressed as p.. sSi g.i«.(i-2 , attained at 0. minimizing the sample variance of the estimator M=M is by calculating result follows E(Vi3 - Vi0 = (51) minimum Thus, the 1. (50) )(y<»- Wo) ^7T^ 2 10£LiMl-2*) (° 2 ) • ] Since different samples are independent, the optimal choice of weights is given by the solution to the following minimization problem: minimize subject to 3 E[£, s=1 YL]=\ w w s 3 (l {\ - - 2 2"){Via s 2 ) = w >0. s 51 - Vi0 )] 2 constant (53) (54) The minimization problem can be simplified = w s (l — 2 s ), by defining x 3 so that the objective function (53) becomes 3 3 £K*2(Vi. - Ko) 2 3=1 If k > 3 + 2EQ2 Yl x k x,{Vu - Vi0 )(Vik - ] (55) then s, -V E[(Vis )(Vlk l0 -V -V l0 = E[(VU = Vij. (58) and (59) follow Combining equations (55) )\Vls }} (56) )E[(Vlk -V )\Vls (57) l0 l0 2 Vio) (58) ] (59) total probability for conditional expectations, from the martingale property of the and } Var(K,)- Equation (56) follows from the law of and equations -V l0 ls ls )(Vlk -V = E[E[(V l0 )} = E[(V walk Ko)]. s=\k=s+\ driftless random (59) gives the following, simplified version of the minimization problem: minimize 2 £'=i x 3 Va,r(Vis ) + 2 £'=i (xaVarV^EJL+i ELi( 2 ' - subject to l)x, xs > where c is solution is 60 ) (61) 0, obtained by formulating the Lagrangian 3 L(x,\) = £x 3 2 Vai(V ,) i 3 + 2Y,(x a Vai(Via *=1 3=1 The Karush-Kuhn-Tucker optimality ws c ( the normalization constant. The optimal weights = **)) satisfying ££- = and £ ) 3 x )+2A(c- £(2' - l)x a ). fc conditions state that (61), then w is (62) 3=1 k=S+l if there exists A and nonnegative the optimal solution of the minimization problem. This derivative can be written as Var(K 5 ) £ x + £ ^Var(Kifc) = k k=s fc=l 52 A(2* - 1). (63) By the orthogonality property of martingales (Williams 1991, Chapter 12), Var(Vls ) for all s > Combining equation k. Vai(Via ) £x The Karush-Kuhn-Tucker notation. Let U vector defined by unit vector. be the us = 2 3x3 s — The optimality 1, = Var(t&) and (63) + Vax(Vu - Vik ) (64) = (64), the condition J^- + J2 x k Va.r(Vl3 - Vik ) = X{V - k Us k = matrix defined by 3x1 reformulated as (65) 1). more conveniently formulated using matrix conditions are v be the is Var(V„ — V,* )l( r : vector given by v s = ,>fc), J u be the Var(Vi s ) and e be the 3x1 3x1 conditions can then be expressed as T (x e)v + Ux = Xu (66) T u x = c (67) x > 0. (68) In order to obtain the optimal vector x, the vector quantities U and v should be determined. For the random walk model described in Subsection 4.3, v3 From the first row of equation (66), = a 2 (F-l). we obtain ax T e = since the first and (70), row of X, U is the zero vector and Ui = equation (66) becomes (69) Ux = 0. 1. Hence, Xi (70) Therefore, by combining equations (69) = X2 = 0,X3 = f the optimality conditions. This completes the proof of Proposition of estimator and A 4. = 2 cr X3 satisfy The most efficient is p-_ - Et = l( 8 ^"iN - Vjp) / • 70 53 \ 71 i'i; Lemma Proof of N; assume that expectations and the fact that E is equal to E\E[I, Yk&p \E[Ly €P i|Vfc]|Yjfc_i , Jj(Xj) is by induction. The statement €P \ \Sk]\Sk-\ By • , = when Corollary The proof is this expression 2, latest LOD observation j = for conditional we can and by the induction hypothesis, the is right equivalent to B Yfc-i- by induction. Recall that from equation (34) 5, = min{aJ+1 (c(nJ+2 + E[J]+l (YJ+l )\Yj}), ) nJ+l cFN P(Y, e P+IYfrnj+iCppPiYj € Let Jj{Yj) true individuals in the pool are indistinguishable, which depends only on the 6: is Using the law of total probability k. AYk ]\Sk-\ Proof of Proposition and Proposition all = E E[L Y write £[/ry- €P JSfc_i] side = true for j is it The proof 1: JjiYj-nj+iCFNPiYj g P+\YN ). By the law P+ |^-). and P-lYj)} j for = 0,... ,N- 1. (72) F(YN = c FP P(YN g P_\YN )-c FN P(YN g ) of total probability and the fact that all individuals in the pool are indistinguishable, H aj+l E[Jj +l (Yj+l )\Yj\ -n j+l CF N P(YjeP+ \Yj) = a J+ \E Jj + i(Yj + i) = a J+1 E[JJ+1 (yj+1 )l^l, — ti j+ 2CfnP{Yj+\ g P+\Y]+\)\Y3 (73) and nj+l = (c FP P(Yj n]+l E e PAY,) - CFNPiYj e P+\Yj)) [c FP P(YN g P.\YN ) c F nP(Yn e P+ \YN )\Y,_ = nj+1 E[F(YN )\Yj}. (74) Subtracting Uj + iCfnP(Y] g P+\Yj) from both sides of (71) yields Jj (Yj ) = irw{aj+ Mnj+2) + E[jj+1 (Yj+l )\Y^ 54 (75) Straightforward algebraic manipulations show that F(YN = hence, F(Y)v) is J\ (Yn) . ) (76) monotone nonincreasing by the assumed monotonicity of min{0, F(V,v)}, the function Jn(Yn) y-. Since J/v(V/v) = monotone nonincreasing and the unique root of is F(V^v) gives the optimal cutoff c^ for stage N. To prove Jj+i(Yj+\) on the {Yj RJ = {Yj < cj} Yj : Yj < c~). Moreover, Jj(Yj) is /R P(X > 7. It is J3 {Yj) x )\Z = Zl ] known [ P(h(Xi) > x\Z = I {P{Xi <h-\x)\Z = = if and only if exists cj (the Y < 3 c~; thus, z2] z )dx x z l : Yj > first c+}. E{X) = J^P{X > x)dx and E(X\Z) = that - E[h{Xi)\Z = = < us assume that let property, the nonzero terms = Rf = {Y} x\Z)dx. For a nondecreasing function h and E[h{X , monotone nonincreasing. This completes the part of the proof. Similar arguments establish that Proof of Proposition some cj monotone nonincreasing. Therefore, there of the roots of the two terms) such that : for Then by the Mon(j') monotone nonincreasing. right side of (74) are also minimum RJ = is inductively that Z\ > Z2, = - f P(h(Xi) > x\Z = )-P(X <h-\x)\Z = l z 2 )dx (77) z 2 )}dx (78) 0. References: Arnold, S.F. 1977. Generalized Group Testing. Annals of Statistics 5, 1170-1182. Behets, F., S. Bertozzi, M. Kasali, M. Kashamuka, L. Atikala, C. Brown, R. W. Ryder and C. Quinn. 1990. Successful use of Pooled Sera to Determine HIV-1 Seroprevalence in Zaire 55 with Development of Cost-Efficiency Models. Burns, K. C. and C. A. Mauro. Concentration. Commun. Cahoon-Young, B., tivity and AIDS 4, 737-741. Group Testing with Test Error 1987. Statist. -Theory Meth. 16, 2821-2837. A. Chandler, T. Livermore, Gaudino and J. Specificity of Pooled Versus Individual Sera in Antibody Prevalence Study. Pool Size for Determination of HIV J. Human Immunodeficiency Virus Gaudino and R. Benjamin Prevalence in Low Cox, D. R. and D. V. Hinkley. 1974. Theoretical HIV 1992. Optimal Risk Populations. Presented at the Surveillance Workshop. South San Fransisco, Dax, E. M. Director, National Benjamin. 1989. Sensi- Ft. Clinical Microbiology 27, 1893-1895. J. Cahoon-Young, B. A. Chandler, T. Livermore, HIV/ AIDS as a Function of CA. Statistics. Chapman and Hall, London. Reference Laboratory, Melbourne, Australia. 1993. Pri- vate Correspondence. de Gourville, E. Research Associate, CAREC, Trinidad W.I. 1992. Private Correspondence. Dorfman, R. 1943. The Detection of Defective Members of Large Populations. Ann. Math. Stat. 44, 436-441. Emmanuel, Human J. C, M. T. Bassett, H. J. Smith and Immunodeficiency Virus (HIV) Testing: ing Countries. J. Clinical Fisher, R. A. 1922. On J. A. Jacobs. 1988. An Economical Method Pooling of Sera for for use in Develop- Pathology 41, 582-585. the Mathematical Foundations of Theoretical Statistics. Phil. Trans. R. Soc. 222, 309-368. 56 George, R. trol, J. Atlanta, George, fection. In George, J. Ft. GA. Hastie, T.J. for Disease Con- 1992. Private Correspondence. and G. Schochetman. AIDS (ed.), HIV/AIDS, Center Chief. Dev. Technology Section. Division Testing, Methodology Springer- Verlag, and D. Pregibon. New Serological Tests for the Detection of 1985. and Management Issues, G. HIV Schochetman and J. R. York, 49-69. 1992. Generalized Linear Models. In Statistical Models in J.M. Chambers and T.J. Hastie, In- (ed.). Wadsworth & Brooks/Cole Computer Science S , Series, California, 195-247. Hull, B. 1991. Serum Pooling for HIV Screening in Trinidad and Tobago. Carribean Epidi- mology Center Technical Report. Hwang, F. K. 1976. Group Testing with a Dilution Hwang, F. K. 1984. Robust Group Testing. Johnson, N. L., S. Chapman and J. Effect. Quality Technology 16, 189-195. Kotz and X. Wu. 1991 Inspection Errors for Attributes in Quality Control. Hall, London. Kline, R. L., T. A. Brothers, R. Brookmeyer, S. Zegger of J. Human Biometnka 63, 671-673. Immunodeficiency Virus Seroprevalence in and T.C. Quinn. 1989. Evaluation Population Surveys using Pooled Sera. Clinical Microbiology 27, 1449-1452. Ledro-Monroy, G. and E. Archbold. 1990. HIV Serum Pooling Study. Cruz Roja Ecuatori- ana. Litvak, E., X. M. Tu and M. Pagano. 1992. Screening for the Presence of 57 HIV by Pooling Sera Samples: Simplified Procedures. Working Paper, Harvard School of Public Health. Madansky, A. 1988. Prescriptions for Working McCullagh, and and J. A. Nelder. I., 1989 Generalized Linear Models, 2nd Edition. York. Chapman K. Mulunga, C. Dulat, D. Lusamba, T. Rehle, R. Korte and H. Jagger. Risk of Transfusion-Associated Press, W. in C, The Art of HIV Transmission H., B. P. Flanney, S. A. Teukolsky Tamashiro, H., Scientific W. Thompson, K.H. Maskill, 1962. in Kishasha, Zaire. AIDS 5, 1991. 437-439. and W. T. Vetterling. 1988. Numerical Recipes Computing. Cambridge University Press, Cambridge. J. Emmanuel, A. Fauquex, Reducing the cost of HIV antibody testing. P. Sato and D. Heymann. 1993. Lancet 342, 87-90. Estimation of the proportion of vectors in a natural population of Biometrics 18, 568-578. insects. Tijssen, P. 1985. Elsevier, Unger, New London. Hall, N'tita, P. Statisticians. Springer- Verlag, Laboratory Techniques in Biochemistry and Molecular Biology. Vol. 15. Amsterdam. P. 1960. The cutoff point for group testing. Communication on Pure and Applied Mathematics 13, 49-54. Williams, D. 1991. Probability with Martingales. Cambridge University Press, Cambridge, England. 58 MIT 3 TDfiO LIBRARIES OUPL OOflSbbEl 5 21:25 036 Date Due - AUG i o iggg Lib-26-67