The Waiting Time Paradox and biases in infectious disease observational data Ping Yan Lecture at Summer School on Mathematics of Infectious Diseases Program Centre for Disease Modelling, York University Outline 1. Data in infectious disease studies are often observational, not following hypothesis design of experiment repetition and randomization 2. Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process). 3. The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process. 4. Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods. 2 The Waiting Time Paradox (Feller 1966) just to show how classic it is in introductory probability textbooks 1. Buses arrive at a constant rate ; the inter-arrival times X’s are independently and identically distributed, mean . 2. An “inspector” (or a customer), inspects “at random” so that the inspection time t is uniformly distributed between the last bus and the next bus. Question: what is the expected waiting time Wt to the next bus arrival ? Argument 1: The inspection time t is uniformly distributed between two buses, for symmetry E[W ] 12 . Argument 2: Because buses arrive at constant rate , X’s are iid. exponentially distributed • the “memoryless” property of the exponential distribution implies that the remaining time to the next bus follows the same exponential distribution, thus E[W ] 1/ . • If E[W ] , shouldn’t E[ X ( B ) ] 2 E[W ] 2 ? Paradox ! Didn’t we assume that X’s are independently and identically distributed, mean = 3 ? The Waiting Time Paradox (Feller 1966) Question: what is the expected waiting time Wt to the next bus arrival ? Variation matters cv coefficient of variation Var[X ] X (B)= duration from the last bus to the next bus seen by the inspector The distribution of X (B) is different from that of X, (paradox as it is assumed X to be iid.) Length-biased distribution w. p.d.f. f ( B ) ( x) xf X ( x) , E[ X ( B ) ] (1 cv 2 ) The waiting time W has p.d.f. fW ( x) F X ( x) , E[W ] 12 (1 cv 2 ) where F X ( x) Pr{ X x} symmetry with respect to X(B) E[W ] 12 is correct if there is no variation in inter-arrival times, cv 0 (if buses are as punctual as Swiss trains.) E[W ] 4 is correct if the variance of inter-arrival times satisfies cv 1 (TTC seems to be worse than this.) The Waiting Time Paradox and bias in observational data A different way of looking at the same problem: 1. Occurrence of the initiating event has constant rate (i.e. time of occurrence is uniform at any given time interval) 2. The duration X is independent from the random process that generates the initiating event. 3. The duration X is iid. with p.d.f. f X (x) and mean 4. At a snapshot, only those who have experienced the initiating event but not the subsequent event are included in a sample, with observed duration X(B) . A sample containing only observations made of X(B) is called a “prevalence cohort”. The distribution from a prevalence cohort corresponds to the p.d.f. and mean f ( B ) ( x) xf X ( x) , E[ X ( B ) ] (1 cv 2 ) because those with longer duration have greater chance to be included in data. Observational data arising in a prevalence cohort Assume the duration X iid. with p.d.f. f X (x), and mean . Under equilibrium: the incidence of the initiating event occurs at constant rate 1. The observed duration is length-biased X(B) has p.d.f. f ( B ) ( x) xf X ( x) . W has p.d.f. fW ( x) F X ( x) , where F X ( x) Pr{ X x} Naïve estimation for the distribution of X (e.g. incubation time, survival time, etc.) based on such prevalence cohort data leads to over-estimation. 2. Size biased estimation for prevalence estimate i.e. the sample is size-biased in favor of cohorts with larger prevalence prevalence = # or % { individuals experienced the initial event but not the subsequent event } Under-equilibrium, prevalence = incidence x duration. The length-bias in observed duration leads to “size-bias” in sampled prevalence. Waiting Time Paradox in disease screening via repeated testing 1. Replacing buses with repeated testing: the inter-testing intervals X’s are iid., mean = . 2. Replacing an “inspector” by sero-conversion, which, under equilibrium, has constant rate, such that given any time interval (between two tests), a sero-conversion may occur and the sero-conversion time is uniformly distributed in the interval. X(B)= duration from the last (neg.) test to the next (pos.) test covering a sero-conversion X(B) has length-biased distribution: E[ X ( B ) ] (1 cv 2 ) E[W ] The average waiting time from sero-conversion to the next (pos.) testing: If we add an average “window period” from infection to sero-conversion, 1 (1 cv 2 ) 2 the prevalence of infected but not yet tested (queue), prev. = incidence x mean duration 1 pu (1 cv 2 ) 2 Keeping and unchanged, the testing strategy (under equilibrium conditions) (1 cv 2 ) determines pu . Waiting Time Paradox in disease screening via repeated testing The prevalence of infected but not yet tested (queue), 1 pu (1 cv 2 ) 2 (under equilibrium conditions) Each infected but not yet tested individual (in pu ) may be associated with a cost c to the society Each test is associated with a cost κ. Both costs are determined according to different contexts. Objective: Under different scenarios of infection incidence determine the optimal testing frequency so that the queue of infected but untested is reduced to satisfy a cost-effective criterion. Generally, the larger the incidence rate , the more cost-effective it is for more frequent testing. Cost-effectiveness is compromised if there is large variation between inter-testing intervals or among individuals. The Waiting Time Paradox as seen in R0 formulation An infected individual produces new infections accounting to a counting process with intensity k (x) N = # of infections produced by a typical infectious individual while seeded into an infinitely large susceptible population R0 = mean value of N , can be expressed as R0 k ( x)dx . 0 The premises: If R0 1, 0, such that Malthusian number 0 e x k ( x)dx 1 describing the early exponential growth Re-write: 0 e x k ( x ) R0 dx e x g ( x)dx R01 , then R0 0 Lg ( ) = Laplace transform of g ( x) Postulate: k ( x) R0 1 , Lg ( ) , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. (Ref: Wallinga and Lipstisch, 2007; Heesterbeek and Roberts, 2007) The Waiting Time Paradox as seen in R0 formulation 0 e x k ( x ) R0 dx e x g ( x)dx R01 0 R0 1 , Lg ( ) = Laplace transform g ( x) Lg ( ) 0 Postulate: k ( x) R0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. If true, the tasks: 1. assessing the meaning of this random variable; 2. 3. 4. 5. assessing whether it is observable; if observable, collect data and estimate g(x); estimate separately (usually via curve fitting); evaluate the Laplace transform Lg ( ) , analytically or numerically. In the above, there is no assumption about the integral k(x), i.e. the model k (x) instantaneous rate at time x However, in order to assess 1., we put into a structured model framework: the SEIR. The Waiting Time Paradox as seen in R0 formulation R0 1 , Lg ( ) Postulate: Lg ( ) = Laplace transform of g ( x) k ( x) R0 , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. Assuming R0 > 1: 1. In the SIR model, with exponentially distributed infectious period S R0 1 / 1 I Lg ( ) 1 I 1 I I is the Laplace transform of the exponential distribution with mean In this case, g(x) is the p.d.f. the infectious period. 2. In the SEIR model, with both the latent period and the infectious period being exponentially distributed S I E I R0 1 / 1 / 1 E 1 I Lg ( ) 1 E 1 I 1 1 = product of two Laplace transforms of the exponential distributions. In this case, g(x) is the p.d.f. the sum of the latent period and the infectious period. Anderson and May (1991) : generation time = latent period + infectious period. R I . R The Waiting Time Paradox as seen in R0 formulation R0 1 , Lg ( ) Postulate: Lg ( ) = Laplace transform of g ( x) k ( x) R0 , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. In the SEIR model, if the latent period and the infectious period are arbitrarily distributed (with specific distributions) R0 I (Yan, 2007) LE ( )[1 LI ( )] LE (s) Laplace transform for latent period LI (s) Laplace transform for infectious period including: TE TI R0 1 I (no latent period, exponentially disted infectious period) R0 1 E 1 I (exponentially disted latent and infectious periods) R0 I 1 1 1 E E E I I I (gamma disted latent and infectious periods, Anderson and Watson, 1980) where E , I are the mean values of the latent and infectious periods E , I are coefficient of variation parameters The Waiting Time Paradox as seen in R0 formulation R0 1 , Lg ( ) Postulate: Lg ( ) = Laplace transform of g ( x) k ( x) R0 , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. In the SEIR model, if the latent period and the infectious period are arbitrarily distributed R0 I LE ( )[1 LI ( )] 1 1 LI ( ) LE ( ) 1 e x f ( x)dx e x I E 0 0 x Lg ( ) e f E ( x)dx e x 0 0 p.d.f. of the latent period TE F I ( x) I F I ( x) I dx dx p.d.f. of W in length-biased infectious period from a snapshot point of view W g(x) is p.d.f. of TE W with mean value E 12 I (1 cv 2 ) Call it generation time : • • if cv 1, consistent with that by Anderson and May (1991); if cv 0, consistent with that by Gani and Daly (2001): mean latent period + half of the mean infectious period Fine (2003): the latent period + part of the infectious period …. • • not exactly, need to emphasize length-biased infectious period could be even longer than the “natural” infectious period if cv 1. The Waiting Time Paradox as seen in R0 formulation R0 1 , Lg ( ) Postulate: Lg ( ) = Laplace transform of g ( x) , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of a well defined random variable with epidemiologic meaning. If true, the tasks: For 1. above, k ( x) R0 1. 2. 3. 4. 5. assessing meaning of this random variable; assessing whether it is observable; if observable, collect data and estimate g(x); estimate separately (usually via curve fitting); evaluate the Laplace transform Lg ( ) , analytically or numerically. g(x) is p.d.f. of the generation time, defined as the latent period plus part of the length-biased infectious period, with mean value E 12 I (1 cv 2 ) In the above definition, the generation time does not involve: • • another individual the transmission process The “snapshot” may be thought as the time of infection of an infectee in relation to the infectious period of its infector; whereas in theory, it could be a snapshot by any “inspector”. The Waiting Time Paradox as seen in R0 formulation R0 1 , Lg ( ) Lg ( ) = Laplace transform of g ( x) k ( x) R0 , satisfying 0 g ( x)dx 1. g(x) is p.d.f. of the generation time, defined as the latent period plus part of 2 the length-biased infectious period, with mean value E 12 I (1 cv ) From Wallinga and Lipsitch (2007): Svensson (2007) made the distinction: i. ii. from the infection time of the infector looking forward to the infection time of the infectee; from the infection time of the infectee looking back to the infection time of the infector. Seems like, if we assign the “snapshot” as the time of infection of an infectee in relation to the infectious period of its infector, then the generation interval in Wallinga and Lipsitch (2007) should be understood in the sense of (ii) in Svensson (2007). But …. there are strings attached …. The Waiting Time Paradox as seen in R0 formulation Svensson (2007): ii. from the infection time of the infectee looking back to the infection time of the infector. If associating TE W with the generation interval in Wallinga and Lipsitch (2007) and understood it in the sense of (ii) in Svensson (2007), there are hidden assumptions. • The infection times of infectees must be exchangeable so that any randomly chosen infectee (if more than one), while looking back, gives the same distribution for TE W . • The infectious period contains 1 infectee, hence length-biased, mean • The infection times of infectees must be uniformly distributed in the (length-biased) 2 infectious period so that W has p.d.f. F ( x) with mean 12 I (1 cv ), i.e. symmetry. I (1 cv 2 ). I I • The system is at equilibrium so that infectors arrive at constant rate. Things that I don’t understand: R0 1 , Lg ( ) Lg ( ) = Laplace transform of TE W 1. valid interpretation at equilibrium 2. is the Malthusian number, far from equilibrium with mean E 12 I (1 cv 2 ) This puzzle further leads to the observation problem: can we collect data at the early phase of an outbreak and use the above theory ? A generalization of the Waiting Time Paradox: left-truncation Moving away from R0 Lg1 ( ) for general observation bias without being in equilibrium Initial event occurs over time t following a random process with intensity (t ). • Individuals who have experienced the initial event are enrolled at time t E • Individuals are followed until an endpoint event, taking place at time t X ( B ) Previously (the Waiting Time Paradox), • assumed equilibrium (t ) ; • called “enrolment” as “snapshot”, assumed uniform distribution in any fixed time interval • the time from initial event to the observed endpoint X(B) , follows p.d.f. xfX (x) • the time from initial event to enrolment E, follows the distribution with p.d.f. F X (e) Generalization • (t ) is not constant; • • enrolment is random, independent from the random process of the initiating event. This observation scheme is subject to left-truncation. The same issue: the observed X(B) is length-biased. A generalization of the Waiting Time Paradox: left-truncation The objective: estimating the distribution of the duration X between the two events. The observed X(B) is length-biased: in favor of longer durations. Naïve analyses: treating X(B) as if X from designed experiments, lead to over-estimation Not-so-naïve method through conditioning: X(B) arises from the conditional distribution of X given X E , because the eligibility of enrolment is not having experienced the endpoint event at t E. Statistical methods are on the conditional distribution f X ( x) F X (e) rather than f X (x), where F X (e) Pr{ X e}. Such a method provides a length-bias adjusted estimation, but is only able to estimate part of the distribution. Some information is lost in the data, unless (t ) is explicitly modelled. Call for joint modelling: transmission model for how epidemiology generate data and statistical model for how data are observed. Right-truncation: length-bias in favour of observing short durations Previously, left-truncation, in favour of observing long durations Very common in surveillance: inclusion criteria is the occurrence of the subsequent event prior to the time of data analysis. Example: Initiating event = diagnosis of a disease Subsequent event = the disease is reported and entered into a registry Objective: assessment of the reporting delay X. Bias: the case has to be reported before the time at analysis; systematically observing data with short delay. Right-truncation: length-bias in favour of observing short durations Example: Initiating event = diagnosis of a disease Subsequent event = the disease is reported and entered into a registry Objective: assessment of the reporting delay X. Bias: systematically observing data with short delay. Reporting delay is a very important issue in all disease surveillance Annual AIDS incidence in Canada as seen in 1992 and 1999 2500 2000 1500 Reporting delay adjusted trend based on 1992 data presented in April 1993 along with the AIDS surveillance report. As reported by Dec.31, 1992 The gap between reported (bars) and projected (lines) trends implied long delay between diagnosis and data entry (into national registry). 1000 Reporting delay adjusted trend 500 As reported by Dec 31, 1999 0 Adjustment of reporting delay N (t ) # cases diagnosed at time t (to be estimated) N (t ; C ) # cases diagnosed at time t and reported by time C (as a proportion of N(t) ) All we need to do is to estimate this proportion, which is Then: N (t ) FX (C t ) Pr{ X C t} N (t ; C ) FX (C t ) Naïve analysis always leads to severe under-estimation of reporting delay. Naïve analysis − − median reporting delay 1.6 months 95% completeness within14 months Not-so-naïve analysis: − median delay approx. 9 months − 85% completeness within 5 years. Adequately accounting for right-truncation and other (adm.) processes, useful tools can be developed to reflect real-time trend and built into the surveillance. Other examples of reporting delay in disease surveillance: did we learn the lesson? SARS outbreak in Toronto, 2003 Pre-mature declaration that SARS was over As turned out: Recall the strong protest against WHO’s travel advisory on Apr. 23 ? H1N1 during the spring of 2009 May 14: Is the worst over? As it turned out: Right-truncation: length-bias in favour of observing short durations Another example: Initiating event = HIV infection via transfusion Subsequent event = onset of AIDS illnesses Objective: estimate the incubation period X. Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion • • naïve estimation: as if data from random experiment, iid. FX (x) not-so-naïve: right-truncation data from the conditional distribution FX ( x) FX (C t ) Based on above data, at C = June 30, 1988 Brookmeyer and Gail (1994): • naïve estimation potentially under-estimate median by 50%, compared with the not-sonaïve analysis (by conditioning) Kalbfleisch and Lawless (1989): ?? naïve uncertainty subject to a constant proportionality 0.5 not-so-naïve • with analysis by conditioning, the larger the C, the longer are the estimated mean and median • without knowing the AIDS incidence, there is a loss of information in data so that one can only estimate up to a constant of proportionality the early part of the incubation period distribution. Right-truncation: length-bias in favour of observing short durations Another example: Initiating event = HIV infection via transfusion Subsequent event = onset of AIDS illnesses Objective: estimate the incubation period X. Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as the only known risk factor, retrospective ascertained date at infection / transfusion • • naïve estimation: as if data from random experiment, iid. FX (x) not-so-naïve: right-truncation data from the conditional distribution FX ( x) FX (C t ) Brookmeyer and Gail (1994): • naïve estimation potentially under-estimate median by 50%, compared with the not-so-naïve analysis (by conditioning) Lui, et al. (1986): C = April 30, 1985 Kalbfleisch and Lawless (1989): • • • naïve analysis: mean 2.6 years conditioning: mean 4.5 years with analysis by conditioning, the larger the C, the longer are the estimated mean and median Lagakos, et al. (1988): C = June 30, 1986 • conditioning: median 8.5 years By the 1990s when large scale multi-center cohort data became available, it turned out that the median incubation period is approximately 10 years. Right-truncation: length-bias in favour of observing short durations Kalbfleisch and Lawless (1989): • with analysis by conditioning, the larger the C, the longer are the estimated mean and median • without knowing the incidence of the initiating event, there is a loss of information that one only estimates up to a constant of proportionality the early part of the duration distribution. The above statements are very important in observational data of an emerging disease with respect to retrospectively ascertained durations (incubation period, serial interval, etc.), later analyses suggest longer distribution than earlier analyses. The underlying disease trend matters. Call for jointly model the disease process (e.g. transmission model) and model the data generation process. A general topic related to statistics issues and disease models Every deterministic compartment model, such as SIR, has a stochastic counterpart. Example: stochastic versus deterministic models for Assume R0 1 : n 1000, 1.5, 1 In these graphs, R0 =1.5 n 10000, 1.5, 1 I d (t ) I d (t ) n = population size Deterministic ↔ what must happen: • a bell-shaped I d (t ) determined by mathematical law. Stochastic ↔ what might happen: • even R0 > 1, there is a positive probability (1/3 in above cases), very few transmissions occur then followed by extinction • otherwise, after “simmering” for a short random period of time, it takes off; − if n , the path is bell-shaped resemble I d (t ) but the origin is random. A general topic related to statistics issues and disease models Statistical challenges in estimating parameters in transmission models a. models are built on unobservable events (e.g. time at infection, the passing of infection from one individual to another, duration of latency (not infectious), duration of infectiousness, duration of immunity, etc.) • data are based on observable events (e.g. clinical onset of illness, stages of illness, duration of illness, death, physical recovery, etc.) b. Observational data subject to length-bias, size bias, missing values, etc. c. some seemingly “large” data (in terms of large population) arise from a single (or few) realization of a random phenomenon (i.e. an outbreak) … extremely small “sample size” n 1000, 1.5, 1 n 10000, 1.5, 1 I d (t ) d. large number of parameters I d (t ) Summary 1. Data in infectious disease studies are often observational, not following hypothesis design of experiment repetition and randomization Data in most introductory statistics textbooks are from repetition of random experiment by design. Naïve adaptation of these models and methods may lead to severe bias. 2. Advanced statistical methods involve statistical modelling. While many infectious disease models focus on the epidemiology aspects (transmission process), statistical models focus on the stochastic mechanism from which data are generated (data generating process). Although transmission process is part of the data generation mechanism, the observer sees data through additional filters, such as data management and administrative processes. 3. The two types of models need to be well integrated. For statistical estimation purposes, even with statistical models (such as conditioning) well intended to capture the length-bias, important information is still lost without modelling the underlying transmission process. The gap is identified. Still lots of work need to be done. 4. Conversely, without statistical modelling for the data generation process, important input parameters in mathematical models may be severely biased based on naïve (or not-so-naïve) statistical methods. Ditto.