The Waiting Time Paradox and biases in infectious diseases

advertisement
The Waiting Time Paradox and biases in
infectious disease observational data
Ping Yan
Lecture at Summer School on Mathematics of Infectious Diseases Program
Centre for Disease Modelling, York University
Outline
1. Data in infectious disease studies are often observational, not following
hypothesis
design of experiment
repetition and randomization
2. Advanced statistical methods involve statistical modelling. While many
infectious disease models focus on the epidemiology aspects (transmission
process), statistical models focus on the stochastic mechanism from which
data are generated (data generating process).
3. The two types of models need to be well integrated. For statistical estimation
purposes, even with statistical models (such as conditioning) well intended to
capture the length-bias, important information is still lost without modelling the
underlying transmission process.
4. Conversely, without statistical modelling for the data generation process,
important input parameters in mathematical models may be severely biased
based on naïve (or not-so-naïve) statistical methods.
2
The Waiting Time Paradox (Feller 1966)
just to show how classic it
is in introductory
probability textbooks
1. Buses arrive at a constant rate  ; the inter-arrival times X’s are independently and identically
distributed, mean
.
2. An “inspector” (or a customer), inspects “at random” so that the inspection time t is
uniformly distributed between the last bus and the next bus.
Question: what is the expected waiting time Wt to the next bus arrival ?
Argument 1: The inspection time t is uniformly distributed between two buses, for symmetry
E[W ]  12  .
Argument 2: Because buses arrive at constant rate  , X’s are iid. exponentially distributed
•
the “memoryless” property of the exponential distribution implies that the remaining
time to the next bus follows the same exponential distribution, thus
E[W ]    1/ .
•
If E[W ]   , shouldn’t E[ X ( B ) ]  2 E[W ]  2 ?
Paradox ! Didn’t we assume that X’s are independently and identically distributed, mean =
3
?
The Waiting Time Paradox (Feller 1966)
Question: what is the expected waiting time Wt to the next bus arrival ?
Variation matters
cv 
coefficient of variation
Var[X ]

X (B)= duration from the last bus to the next bus seen by the inspector
The distribution of X (B) is different from that of X, (paradox as it is assumed X to be iid.)
Length-biased distribution w. p.d.f.
f ( B ) ( x)  xf X ( x)  ,
E[ X ( B ) ]   (1  cv 2 )  
The waiting time W has p.d.f.
fW ( x)  F X ( x)  ,
E[W ]  12  (1  cv 2 )
where
F X ( x)  Pr{ X  x}
symmetry with respect to X(B)
E[W ]  12  is correct if there is no variation in inter-arrival times, cv  0
(if buses are as punctual as Swiss trains.)
E[W ]  
4
is correct if the variance of inter-arrival times satisfies cv  1
(TTC seems to be worse than this.)
The Waiting Time Paradox and bias in observational data
A different way of looking at the same problem:
1. Occurrence of the initiating event has constant rate (i.e.
time of occurrence is uniform at any given time interval)
2. The duration X is independent from the random process
that generates the initiating event.
3. The duration X is iid. with p.d.f.
f X (x) and mean 
4. At a snapshot, only those who have experienced the initiating
event but not the subsequent event are included in a sample,
with observed duration X(B) .
A sample containing only observations made of X(B) is called a “prevalence cohort”.
The distribution from a prevalence cohort corresponds to the p.d.f. and mean
f ( B ) ( x)  xf X ( x)  ,
E[ X ( B ) ]   (1  cv 2 )  
because those with longer duration have greater chance to be included in data.
Observational data arising in a prevalence cohort
Assume the duration X iid. with p.d.f. f X (x), and mean  .
Under equilibrium: the incidence of the initiating event occurs at constant rate
1. The observed duration is length-biased
X(B) has p.d.f.
f ( B ) ( x)  xf X ( x)  .
W has p.d.f.
fW ( x)  F X ( x)  ,
where F X ( x)  Pr{ X  x}
Naïve estimation for the distribution of X (e.g. incubation time, survival time, etc.)
based on such prevalence cohort data leads to over-estimation.
2. Size biased estimation for prevalence estimate
i.e. the sample is size-biased in favor of cohorts with larger prevalence
prevalence = # or % { individuals experienced the initial event but not the subsequent event }
Under-equilibrium,
prevalence = incidence x duration.
The length-bias in observed duration leads to “size-bias” in sampled prevalence.
Waiting Time Paradox in disease screening via repeated testing
1. Replacing buses with repeated testing: the inter-testing intervals X’s are iid., mean =  .
2. Replacing an “inspector” by sero-conversion, which, under equilibrium, has constant rate,
such that given any time interval (between two tests), a sero-conversion may occur and the
sero-conversion time is uniformly distributed in the interval.
X(B)= duration from the last (neg.) test to the next (pos.) test covering a sero-conversion
X(B) has length-biased distribution:
E[ X ( B ) ]   (1  cv 2 )  
E[W ] 
The average waiting time from sero-conversion to the next (pos.) testing:
If we add an average “window period” from infection to sero-conversion,

1
 (1  cv 2 )
2
the prevalence of infected but not yet tested (queue), prev. = incidence x mean duration
1


pu      (1  cv 2 )
2


Keeping  and  unchanged, the testing strategy
(under equilibrium conditions)
 (1  cv 2 ) determines pu .
Waiting Time Paradox in disease screening via repeated testing
The prevalence of infected but not yet tested (queue),
1


pu      (1  cv 2 )
2


(under equilibrium conditions)
Each infected but not yet tested individual (in pu ) may be associated with a cost c to the society
Each test is associated with a cost κ.
Both costs are determined according to different contexts.
Objective: Under different scenarios of infection incidence 
determine the optimal testing frequency  so that the queue of
infected but untested is reduced to satisfy a cost-effective criterion.
Generally, the larger the incidence rate
 , the more cost-effective it is for more frequent testing.
Cost-effectiveness is compromised if there is large variation between inter-testing
intervals or among individuals.
The Waiting Time Paradox as seen in R0 formulation
An infected individual produces new infections accounting to a counting process with intensity k (x)
N = # of infections produced by a typical infectious individual while seeded into an infinitely
large susceptible population

R0 = mean value of N , can be expressed as R0  k ( x)dx  .

0
The premises:
If R0  1,   0, such that
  Malthusian number


0
e  x k ( x)dx  1
describing the early exponential growth
Re-write:


0
e
 x k ( x )
R0

dx   e x g ( x)dx  R01 , then R0 
0
Lg (  ) = Laplace transform of g ( x) 
Postulate:
k ( x)
R0
1
,
Lg (  )
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
(Ref: Wallinga and Lipstisch, 2007; Heesterbeek and Roberts, 2007)
The Waiting Time Paradox as seen in R0 formulation


0
e
 x k ( x )
R0

dx   e x g ( x)dx  R01
0
R0 
1
, Lg (  ) = Laplace transform g ( x) 
Lg (  )


0
Postulate:
k ( x)
R0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
If true, the tasks: 1. assessing the meaning of this random variable;
2.
3.
4.
5.
assessing whether it is observable;
if observable, collect data and estimate g(x);
estimate  separately (usually via curve fitting);
evaluate the Laplace transform Lg (  ) , analytically or numerically.
In the above, there is no assumption about the integral k(x), i.e. the model
k (x)  instantaneous rate at time x
However, in order to assess 1.,
we put into a structured model
framework: the SEIR.
The Waiting Time Paradox as seen in R0 formulation
R0 
1
,
Lg (  )
Postulate:
Lg (  ) = Laplace transform of g ( x) 
k ( x)
R0
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
Assuming R0 > 1:
1. In the SIR model, with exponentially distributed infectious period
S
R0  1   /   1  I
Lg (  )  1  I 
1
I
I

is the Laplace transform of the exponential distribution with mean
In this case, g(x) is the p.d.f. the infectious period.
2. In the SEIR model, with both the latent period and the
infectious period being exponentially distributed
S
I
E

I
R0  1   /  1   /    1  E 1  I 
Lg (  )  1  E  1  I 
1
1
= product of two Laplace transforms of the exponential distributions.
In this case, g(x) is the p.d.f. the sum of the latent period and the infectious period.
Anderson and May (1991) : generation time = latent period + infectious period.

R
I .
R
The Waiting Time Paradox as seen in R0 formulation
R0 
1
,
Lg (  )
Postulate:
Lg (  ) = Laplace transform of g ( x) 
k ( x)
R0
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
In the SEIR model, if the latent period and the infectious period are arbitrarily distributed
(with specific distributions)
R0 
 I
(Yan, 2007)
LE (  )[1  LI (  )]
LE (s)  Laplace transform for latent period
LI (s)  Laplace transform for infectious period
including:
TE
TI
R0  1  I
(no latent period, exponentially disted infectious period)
R0  1  E 1  I 
(exponentially disted latent and infectious periods)
R0 
 I 1 

1 1

E  E
E
  I
I
I

(gamma disted latent and infectious periods,
Anderson and Watson, 1980)
where  E ,  I  are the mean values of the latent and infectious periods
 E ,  I 
are coefficient of variation parameters
The Waiting Time Paradox as seen in R0 formulation
R0 
1
,
Lg (  )
Postulate:
Lg (  ) = Laplace transform of g ( x) 
k ( x)
R0
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
In the SEIR model, if the latent period and the infectious period are arbitrarily distributed
R0 
 I
LE (  )[1  LI (  )]

1
1

LI (  )
LE (  ) 1 
  e  x f ( x)dx   e  x
I
E
 0
 0


 x



Lg (  )    e f E ( x)dx   e  x
 0
 0
p.d.f. of the latent period
TE
F I ( x)
I
F I ( x)
I
dx 

dx 

p.d.f. of W in length-biased infectious period
from a snapshot point of view
W
g(x) is p.d.f. of TE  W with mean value  E  12  I (1  cv 2 )
Call it generation time :
•
•
if cv  1, consistent with that by Anderson and May (1991);
if cv  0, consistent with that by Gani and Daly (2001):
mean latent period + half of the mean infectious period
Fine (2003): the latent period + part of the infectious period ….
•
•
not exactly, need to emphasize length-biased infectious period
could be even longer than the “natural” infectious period if cv  1.
The Waiting Time Paradox as seen in R0 formulation
R0 
1
,
Lg (  )
Postulate:
Lg (  ) = Laplace transform of g ( x) 
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of a well defined random variable with epidemiologic meaning.
If true, the tasks:
For 1. above,
k ( x)
R0
1.
2.
3.
4.
5.
assessing meaning of this random variable;
assessing whether it is observable;
if observable, collect data and estimate g(x);
estimate  separately (usually via curve fitting);
evaluate the Laplace transform Lg (  ) , analytically or numerically.
g(x) is p.d.f. of the generation time, defined as the latent period plus part of
the length-biased infectious period, with mean value  E  12  I (1  cv 2 )
In the above definition, the generation time does not involve: •
•
another individual
the transmission process
The “snapshot” may be thought as the time of infection of an infectee in relation to the
infectious period of its infector; whereas in theory, it could be a snapshot by any “inspector”.
The Waiting Time Paradox as seen in R0 formulation
R0 
1
,
Lg (  )
Lg (  ) = Laplace transform of g ( x) 
k ( x)
R0
, satisfying


0
g ( x)dx  1.
g(x) is p.d.f. of the generation time, defined as the latent period plus part of
2
the length-biased infectious period, with mean value  E  12  I (1  cv )
From Wallinga and Lipsitch (2007):
Svensson (2007) made the distinction:
i.
ii.
from the infection time of the infector looking forward to the infection time of the infectee;
from the infection time of the infectee looking back to the infection time of the infector.
Seems like, if we assign the “snapshot” as the time of infection of an infectee in relation to
the infectious period of its infector, then the generation interval in Wallinga and Lipsitch
(2007) should be understood in the sense of (ii) in Svensson (2007).
But …. there are strings attached ….
The Waiting Time Paradox as seen in R0 formulation
Svensson (2007):
ii.
from the infection time of the infectee looking
back to the infection time of the infector.
If associating TE  W with the generation interval in Wallinga and Lipsitch (2007) and
understood it in the sense of (ii) in Svensson (2007), there are hidden assumptions.
•
The infection times of infectees must be exchangeable so that any randomly chosen
infectee (if more than one), while looking back, gives the same distribution for TE  W .
•
The infectious period contains  1 infectee, hence length-biased, mean
•
The infection times of infectees must be uniformly distributed in the (length-biased)
2
infectious period so that W has p.d.f. F ( x) with mean 12  I (1  cv ), i.e. symmetry.
 I (1  cv 2 ).
I
I
•
The system is at equilibrium so that infectors arrive at constant rate.
Things that I don’t understand:
R0 
1
,
Lg (  )
Lg (  ) = Laplace transform of TE  W
1. valid interpretation at equilibrium
2.  is the Malthusian number, far from equilibrium
with mean
 E  12  I (1  cv 2 )
This puzzle further leads to the observation problem: can we collect data at the
early phase of an outbreak and use the above theory ?
A generalization of the Waiting Time Paradox: left-truncation
Moving away from R0  Lg1 (  ) for general observation bias without being in equilibrium
Initial event occurs over time t following a random process with intensity  (t ).
• Individuals who have experienced the initial event are enrolled at time t  E
• Individuals are followed until an endpoint event, taking place at time t  X ( B )
Previously (the Waiting Time Paradox),
• assumed equilibrium  (t )   ;
•
called “enrolment” as “snapshot”, assumed uniform distribution in any fixed time interval
•
the time from initial event to the observed endpoint X(B) , follows p.d.f. xfX (x)
•
the time from initial event to enrolment E, follows the distribution with p.d.f. F X (e) 
Generalization

•  (t ) is not constant;
•
•
enrolment is random, independent from the random process of
the initiating event.
This observation scheme is subject to left-truncation.
The same issue: the observed X(B) is length-biased.
A generalization of the Waiting Time Paradox: left-truncation
The objective: estimating the distribution of the duration X between the two events.
The observed X(B) is length-biased: in favor of longer durations.
Naïve analyses: treating X(B) as if X from designed experiments, lead to over-estimation
Not-so-naïve method through conditioning:
X(B) arises from the conditional distribution of X given X  E , because the eligibility of
enrolment is not having experienced the endpoint event at t  E.
Statistical methods are on the conditional distribution f X ( x) F X (e) rather than f X (x),
where F X (e)  Pr{ X  e}.
Such a method provides a length-bias adjusted estimation, but is only able to estimate part
of the distribution. Some information is lost in the data, unless  (t ) is explicitly modelled.
Call for joint modelling:
transmission model for how epidemiology generate data and statistical
model for how data are observed.
Right-truncation: length-bias in favour of observing short durations
Previously, left-truncation, in favour of observing long durations
Very common in surveillance: inclusion criteria is the occurrence of the
subsequent event prior to the time of data analysis.
Example:
Initiating event = diagnosis of a disease
Subsequent event = the disease is reported and entered into a registry
Objective: assessment of the reporting delay X.
Bias: the case has to be reported before the time at analysis;
systematically observing data with short delay.
Right-truncation: length-bias in favour of observing short durations
Example:
Initiating event = diagnosis of a disease
Subsequent event = the disease is reported and entered into a registry
Objective: assessment of the reporting delay X.
Bias: systematically observing data with short delay.
Reporting delay is a very important issue in all disease surveillance
Annual AIDS incidence in Canada as seen in 1992 and 1999
2500
2000
1500
Reporting delay adjusted trend
based on 1992 data presented in
April 1993 along with the AIDS
surveillance report.
As reported by Dec.31, 1992
The gap between reported (bars)
and projected (lines) trends implied
long delay between diagnosis and
data entry (into national registry).
1000
Reporting delay
adjusted trend
500
As reported by
Dec 31, 1999
0
Adjustment of reporting delay
N (t )  # cases diagnosed at time t (to be estimated)
N (t ; C )  # cases diagnosed at time t and reported by time C (as a proportion of N(t) )
All we need to do is to estimate this proportion, which is
Then:
N (t ) 
FX (C  t )  Pr{ X  C  t}
N (t ; C )
FX (C  t )
Naïve analysis always leads to severe
under-estimation of reporting delay.
Naïve analysis
−
−
median reporting delay 1.6 months
95% completeness within14 months
Not-so-naïve analysis:
− median delay approx. 9 months
− 85% completeness within 5 years.
Adequately accounting for right-truncation and other (adm.) processes, useful
tools can be developed to reflect real-time trend and built into the surveillance.
Other examples of reporting delay in disease surveillance: did we learn the lesson?
SARS outbreak in Toronto, 2003
Pre-mature declaration that SARS was over
As turned out:
Recall the strong protest
against WHO’s travel
advisory on Apr. 23 ?
H1N1 during the spring of 2009
May 14: Is the worst over?
As it turned out:
Right-truncation: length-bias in favour of observing short durations
Another example:
Initiating event = HIV infection via transfusion
Subsequent event = onset of AIDS illnesses
Objective: estimate the incubation period X.
Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as
the only known risk factor, retrospective ascertained date at infection / transfusion
•
•
naïve estimation: as if data from random experiment, iid. FX (x)
not-so-naïve: right-truncation
data from the conditional distribution FX ( x) FX (C  t )
Based on above data, at C = June 30, 1988
Brookmeyer and Gail (1994):
•
naïve estimation potentially under-estimate
median by 50%, compared with the not-sonaïve analysis (by conditioning)
Kalbfleisch and Lawless (1989):
??
naïve
uncertainty
subject to a
constant
proportionality
0.5
not-so-naïve
•
with analysis by conditioning, the larger the C,
the longer are the estimated mean and median
•
without knowing the AIDS incidence, there is a loss of information in data so that one can only
estimate up to a constant of proportionality the early part of the incubation period distribution.
Right-truncation: length-bias in favour of observing short durations
Another example:
Initiating event = HIV infection via transfusion
Subsequent event = onset of AIDS illnesses
Objective: estimate the incubation period X.
Data: transfusion-associated AIDS cases, assembled by the U.S. CDC, with transfusion as
the only known risk factor, retrospective ascertained date at infection / transfusion
•
•
naïve estimation: as if data from random experiment, iid. FX (x)
not-so-naïve: right-truncation
data from the conditional distribution FX ( x) FX (C  t )
Brookmeyer and Gail (1994):
•
naïve estimation potentially under-estimate median by 50%,
compared with the not-so-naïve analysis (by conditioning)
Lui, et al. (1986): C = April 30, 1985
Kalbfleisch and Lawless (1989): •
•
•
naïve analysis: mean   2.6 years
conditioning: mean   4.5 years
with analysis by conditioning, the larger the C, the longer
are the estimated mean and median
Lagakos, et al. (1988): C = June 30, 1986
• conditioning: median 8.5 years
By the 1990s when large scale multi-center cohort data became available, it turned out
that the median incubation period is approximately 10 years.
Right-truncation: length-bias in favour of observing short durations
Kalbfleisch and Lawless (1989):
•
with analysis by conditioning, the larger the C, the longer are the estimated mean and median
•
without knowing the incidence of the initiating event, there is a loss of information that one
only estimates up to a constant of proportionality the early part of the duration distribution.
The above statements are very important in observational data of an emerging disease with
respect to retrospectively ascertained durations (incubation period, serial interval, etc.), later
analyses suggest longer distribution than earlier analyses.
The underlying disease trend matters.
Call for jointly model the disease process (e.g. transmission model) and model the data
generation process.
A general topic related to statistics issues and disease models
Every deterministic compartment model, such as SIR, has a stochastic counterpart.
Example: stochastic versus deterministic models for
Assume R0     1 :
n  1000,   1.5,   1
In these graphs, R0 =1.5
n  10000,   1.5,   1
I d (t )
I d (t )
n = population size
Deterministic ↔ what must happen:
•
a bell-shaped I d (t ) determined by mathematical law.
Stochastic ↔ what might happen:
•
even R0 > 1, there is a positive probability (1/3 in above cases), very few
transmissions occur then followed by extinction
•
otherwise, after “simmering” for a short random period of time, it takes off;
− if n   , the path is bell-shaped resemble I d (t ) but the origin is random.
A general topic related to statistics issues and disease models
Statistical challenges in estimating parameters in transmission models
a. models are built on unobservable events (e.g. time at infection, the passing of
infection from one individual to another, duration of latency (not infectious),
duration of infectiousness, duration of immunity, etc.)
•
data are based on observable events (e.g. clinical onset of illness, stages of
illness, duration of illness, death, physical recovery, etc.)
b. Observational data subject to length-bias, size bias, missing values, etc.
c. some seemingly “large” data (in terms of large population) arise from a single (or
few) realization of a random phenomenon (i.e. an outbreak) … extremely small
“sample size”
n  1000,   1.5,   1
n  10000,   1.5,   1
I d (t )
d. large number of parameters
I d (t )
Summary
1. Data in infectious disease studies are often observational, not following
hypothesis
design of experiment
repetition and randomization
Data in most introductory statistics textbooks are from repetition of random experiment
by design. Naïve adaptation of these models and methods may lead to severe bias.
2. Advanced statistical methods involve statistical modelling. While many
infectious disease models focus on the epidemiology aspects (transmission
process), statistical models focus on the stochastic mechanism from which
data are generated (data generating process).
Although transmission process is part of the data generation mechanism, the observer sees
data through additional filters, such as data management and administrative processes.
3. The two types of models need to be well integrated. For statistical estimation
purposes, even with statistical models (such as conditioning) well intended to
capture the length-bias, important information is still lost without modelling the
underlying transmission process.
The gap is identified. Still lots of work need to be done.
4. Conversely, without statistical modelling for the data generation process,
important input parameters in mathematical models may be severely biased
based on naïve (or not-so-naïve) statistical methods.
Ditto.
Download