Scan Statistic

advertisement
Summary of “A Spatial Scan
Statistic” by M. Kulldorff
Presented by Gauri S. Datta
gauri@stat.uga.edu
Mid-Year Meeting
February 3, 2006
Background
• Scan Statistic
– A tool to detect cluster in a Point Process
– Naus (1965 JASA) studied in one dimension
– tests if a 1-dim point process is purely random
• Point Process
– Consider a time interval [a,b] and a window
A=[t,t+w] of fixed width w
– (A)= # of e-mails arrived in the time window A
– n(A) ´ nA = # of junk e-mails = number of “points”
– Arrival times of junk e-mails define a “Point
Process”
Main Idea in Scan Statistic
• Move a window [t,t+w] of size w < b-a over
a time interval [a,b]
• Over all possible values of t, record the
maximum number of points in the window
• Compare this number with cut off points
under the the hypothesis of a purely
Poisson Process
p
p
q
Building block of Scan Test
• Repeated use of tests for equality of two
Binomial or Poisson populations
• Two populations are defined by the
scanning window A and its complement Ac
• As in multiple comparison, these tests are
dependent as one moves the scanning
window
Spatial Scan Statistic (SSS)
• Kulldorff (1997) used SSS to detect
clusters in spatial process
• SSS can be used
– In multi-dim point process
– With variable window size
– With baseline process an inhomogeneous
Poisson process or Bernoulli Process
SSS (continued)
– Scanning window can be any predefined
shape
– SSS is on a geographical space G with a
measure 
– In traditional point process, G is a line,  is a
uniform measure
– In 2-dim, G is a plane,  a Lebesgue measure
p
p
q
Examples
• Forestry:
– Spatial clustering of trees.
– Want to see for clusters of a specific kind of
trees after adjusting for uneven spatial
distribution of all trees
– (A)=Total # of trees in region A
– nA=# of trees in A of specific kind
Examples (continued)
• Epidemiology
– Interest in detecting geographical clusters of
disease
– Need to adjust for uneven population density
• Rural vs. urban population
– For data aggregated into census districts,
measure is concentrated at the central
coordinates of districts
Examples (continued)
• If interest is in space-time clusters of a
disease, the measure will still be
concentrated in the geographical region as
in the prior example
• Adjusting for uneven population
distribution is not always enough. Should
take confounding factors into account.
E.g., in epidemiology measure can reflect
standardized expected incidence rate
SS = LR statistic
• For a fixed size window, scan statistic is
the maximum # of points in the window at
any given time/geographical region
• Test Stat is equivalent to LR test statistic
for testing H0:1=2 vs. Ha:1>2
• Generalization to LR test is important for
variable window
Generalized SS: Notation/Models
• G= Geographical area / study space
• A= Window ½ G
• N(A)= Random # of points in A
– A spatial point process
• Goal to find the prominent cluster
• Two useful models for point process
– (a) Bernoulli model
– (b) Poisson model
Standard Models for SS
• For Bernoulli model, measure  is such
that (A) is an integer for all subsets A of
G
– Two states (disease “point” or no disease) for
each unit
• Location of the points define a point
process
LR Test: Bernoulli Model
LR Test: Bernoulli Model
Poisson Model
• Under Poisson model, points generated by
inhom. Poiss. Proc. There is exactly one
zone Z  G s.t. N(A)  Po(pµ(AZ) +
qµ(AZc)) for all A.
• Null hypothesis H0:p=q
• Alternative hypo H1: p>q, Z .
• Under H0, N(A)  Po(pµ(A)) for all A.
• - the parameter Z disappears under H0
Poisson Model (continued)
Poisson Model (continued)
Poisson Model (continued)
Choice of Zones
•
How is  selected? Possibilities:
(1) All circular subsets
(2) All circles centered at any of several foci on
a fixed grid, with a possible upper limit on
size
(3) Same as (2) but with a fixed size
(4) All rectangles of fixed size and shape
(5) If looking for space-time clusters, use
“cylinders” scanning circular geographical
areas over variable time intervals
Bernoulli vs. Posson Model
• Choice between a Bernoulli or Poisson
model does not matter much if
n(G) << (G)
In other cases, use the model most
appropriate for application
A Useful Result
An important result on most likely cluster
based on these models is given in the
paper. It states that as long as the points
within the zone constituting the most likely
cluster are located where they are, H_0
will be rejected irrespective of the other
points in G. If a cluster is located in
Seattle, locations of the points in the east
coast of U.S. do not matter (Theorem 1)
Computations and MC
• To find the value of λ, we need to
calculate LR maximized over collection of
zones in H1. Seems like a daunting task
since # of zones could be infinite.
• # of observed points finite
• For a fixed # of points, likelihood
decreases as µ(Z) increases
Computations (cont’d)
• If the circle size increases for a fixed foci,
need to recalculate likelihood whenever a
new point enters the circle. For a finite
points, # of recalc’ing likelihood for each
foci is finite.
• Distribution of λ is difficult. MC simulation
used to generate histogram of λ . Under
H0, replicate the data sets conditional on
nG .
Application of SSS to SIDS
• Bernoulli and Poisson models are
illustrated using the SIDS data from NC
• For 100 counties in NC, total # of live
births and # of SIDS cases for 1974-84.
• Live births range from 567 to 52345
• Location of county seats are the
coordinates. Measure is the # of live births
in a county
Application to SIDS (continued)
• Zones for scanning window are circles centered
at a county coordinate point including at most
half of the total population
• Zones are circular only wrt the aggregated data.
As circles around a county seat are drawn, other
counties will either be completely part of a zone
or else not at all, depending on whether its
county seat is within the circle or not
Bernoulli model for SIDS
• Bernoulli model is very natural. Each birth
can correspond to at most one SID. Table
1 summarizes the results of the analysis.
• From Figure 1, the most likely cluster A,
consists of Bladen, Columbus, Hoke,
Robeson, and Scotland.
• Using a conservative test, a secondary
cluster is B, consists of Halifax, Hartford
and Northampton counties.
Poisson model for SIDS
• For a rare disease SIDS, Poisson model
gives a close approximation to Bernoulli.
Results are reported in Table 1
• Both models detect the same cluster
• P-values for the primary cluster are same
for both the models; p-values for the
secondary cluster are very close
Application to SIDS (continued)
Two significant clusters based on
SSS
SSS adjusted for Race
• For SIDS one useful covariate is race
• Race is related to SIDS through
unobserved covariates such as quality of
housing, access to health care
• Overall incidence of SIDS for white
children is 1.512 per 1000 and for black
children is 2.970 per 1000.
SSS: race-adjusted (continued)
• Racial distribution differs widely among the
counties in NC
• This analysis leads to the same primary
cluster (see Figure 2)
• Previous secondary cluster disappeared
but a third secondary cluster C emerges.
Cluster C consists of a bunch of counties
in the western part of the state
Application to SIDS (continued)
SSS to SIDS adjusted for race
A Bayesian alternative to SSS
•
Scott and Berger (2006): Idea of Bayesian multiple testing.
•
Observe Xj  N(µj, σ2), j=1,…,M,
•
To determine which µj are nonzero  we have M
(conditionally) independent tests, each testing
H0j:µj = 0 vs. H1j: µj ≠ 0
•
p0 = prior probability that µj is zero
•
Crucial point here: let data estimate p0 .
•
S&B use the hierarchical model
1. Xj|µj , σ2, γj ~ N(γjµj, σ2), independently
2. µj | τ2 ~ I.I.D. N(0, τ2 ), γj |p0 ~ I.I.D. Bern (1-p0)
3. (τ2 , σ2) ~ π (τ2 , σ2) =(τ2 + σ2)-2, p0~ π(p0)
Several choices for π(p0): Uniform, Beta(a,1)
S&B computed posterior probability γj =1.
Modification of S&B Model
•
Assume Xj  N(µj, σ2), j=1,…,M,
•
To determine which µj are positive  we have M
(conditionally) independent tests, each testing
H0j:µj = 0 vs. H1j: µj > 0
•
As before
1. Xj|µj , σ2, γj ~ N(γjµj, σ2), independently
2. µj | µ(-j), ρ, τ2 ~ N(ρ∑qjkµk, τ2 ), [CAR]
γj |pj ~ Ind. Bern (1-pj)
3. (τ2, σ2, ρ) ~ π (τ2 , σ2, ρ) =(τ2 + σ2)-2
4. CAR model on logit(pj)
Compute posterior probability of µj >0.
Download