Bayesian Biosurveillance Using Multiple Data Streams

advertisement
Bayesian Biosurveillance Using Multiple
Data Streams
Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander,
John Dowling, Bill Hogan, Mike Wagner
RODS Laboratory, University of Pittsburgh
* Intel Research, Santa Clara
2004 University of Pittsburgh
Outline
1. Introduction
2. Model
3. Inference
4. Conclusions
2004 University of Pittsburgh
Over-the-Counter (OTC) Data Being Collected by the
National Retail Data Monitor (NRDM)
19,000 stores
50% market
share
nationally
>70% market
share in large
cities
2004 University of Pittsburgh
ED Chief Complaint Data Being Collected by RODS
Chief Complaint ED Records for Allegheny County
Date / Time Admitted
Age
Gender
Home Zip
Nov 1, 2004 3:02
20-30
Male
15213
Nov 1, 2004 3:09
70-80
Female
15132
15213
Fever
:
:
:
:
:
:
2004 University of Pittsburgh
Work Zip
Chief Complaint
Shortness of breath
Objective
Using the ED and OTC data streams, detect a
disease outbreak in a given region as quickly
and accurately as possible
2004 University of Pittsburgh
Our Approach
Population-wide ANomaly Detection
and Assessment (PANDA)
• A detection algorithm that models each
individual in the population
• Combines ED and OTC data streams
• The current prototype focuses on detecting
an outdoor aerosolized release of an
anthrax-like agent in Allegheny county
2004 University of Pittsburgh
PANDA
Uses a causal Bayesian network
Home Location of Person
Anthrax Infection of Person
Visit of Person to ED
Location of Anthrax Release
Bayesian Network: A graphical model representing the
joint probability distribution of a set of random variables
2004 University of Pittsburgh
PANDA
Uses a causal Bayesian network
Home Location of Person
Anthrax Infection of Person
Visit of Person to ED
Location of Anthrax Release
The arrows convey conditional independence relationships among the
variables. They also represent causal relationships.
2004 University of Pittsburgh
Outline
1. Introduction
2. Model
3. Inference
4. Conclusions
2004 University of Pittsburgh
A Schematic of the Generic PANDA Model
for Non-Contagious Diseases
Population Risk
Factors
Population Disease
Exposure (PDE)
Person Model
Person Model
Person Model
Population-Wide
Evidence
2004 University of Pittsburgh
Person Model
A Special Case of the Generic Model
Anthrax Release
Location of Release
Person Model
Person Model
Time of Release
Person Model
Person Model
OTC Sales for Region
Each person in the population is represented as a
subnetwork in the overall model
2004 University of Pittsburgh
The Person Model
Location of Release
Age Decile
Home Zip
Time Of Release
Gender
Anthrax Infection
Non-ED Acute
Respiratory Infection
Other ED Disease
Respiratory
from Anthrax
Respiratory CC
From Other
ED Acute
Respiratory
Infection
Acute Respiratory
Infection
ED Admit
from Anthrax
Respiratory
CC
ED Admit from Other
Daily OTC Purchase
Respiratory CC
When Admitted
Last 3 Days OTC
Purchase
ED Admission
OTC Sales for Region
2004 University of Pittsburgh
Why Use a Population-Based Approach?
1. Representational power
•
•
Spatial, temporal, demographic, and symptom
knowledge of potential diseases can be coherently
represented in a single model
Spatial, temporal, demographic, and symptom
evidence can be combined to derive a posterior
probability of a disease outbreak
2. Representational flexibility
New types of knowledge and evidence can be readily
incorporated into the model
Hypothesis: A population-based approach will achieve
better detection performance than non-populationbased approaches.
2004 University of Pittsburgh
The Person Model
Location of Release
Age Decile
Home Zip
Time Of Release
Gender
Anthrax Infection
Non-ED Acute
Respiratory Infection
Other ED Disease
Respiratory
from Anthrax
Respiratory CC
From Other
ED Acute
Respiratory
Infection
Acute Respiratory
Infection
ED Admit
from Anthrax
Respiratory
CC
ED Admit from Other
Daily OTC Purchase
Respiratory CC
When Admitted
Last 3 Days OTC
Purchase
ED Admission
OTC Sales for Region
2004 University of Pittsburgh
The Person Model
Location of Release
Age Decile
Home Zip
Time Of Release
Gender
Anthrax Infection
Non-ED Acute
Respiratory Infection
Other ED Disease
Respiratory
from Anthrax
Respiratory CC
From Other
ED Acute
Respiratory
Infection
Acute Respiratory
Infection
Respiratory
CC
ED Admit
from Anthrax
ED Admit from Other
Daily OTC Purchase
Respiratory CC
When Admitted
Last 3 Days OTC
Purchase
ED Admission
Equivalence Class
Example:
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
20-30
Male
15213
Yes
Today
Outline
1. Introduction
2. Model
3. Inference
4. Conclusions
2004 University of Pittsburgh
Inference
Anthrax Release
Location of Release
Person Model
Person Model
Time of Release
Person Model
Person Model
OTC Sales for Region
Derive P (Anthrax Release = true | OTC Sales Data & ED Data)
2004 University of Pittsburgh
Inference
AR = Anthrax Release
PDE = Population Disease Exposure
ED = ED Data
OTC = OTC Counts
Key Term in Deriving P ( AR | OTC, ED ) :
P ( OTC, ED | PDE ) =
P ( OTC | ED, PDE ) P ( ED | PDE )
Contribution of OTC Counts
Contribution of ED Data
Details in: Cooper GF, Dash DH, Levander J, Wong W-K, Hogan W, Wagner M.
Bayesian Biosurveillance of Disease Outbreaks. In: Proceedings of the Conference
on Uncertainty in Artificial Intelligence, 2004.
2004 University of Pittsburgh
Inference
AR = Anthrax Release
PDE = Population Disease Exposure
ED = ED Data
OTC = OTC Counts
Key Term in Deriving P ( AR | OTC, ED ) :
P ( OTC, ED | PDE ) =
P ( OTC | ED, PDE ) P ( ED | PDE )
The focus of the remainder of this talk
2004 University of Pittsburgh
The Person Model
Location of Release
Age Decile
Home Zip
Time Of Release
Gender
Anthrax Infection
Non-ED Acute
Respiratory Infection
Other ED Disease
Respiratory
from Anthrax
Respiratory CC
From Other
ED Acute
Respiratory
Infection
Acute Respiratory
Infection
ED Admit
from Anthrax
Respiratory
CC
ED Admit from Other
Daily OTC Purchase
Respiratory CC
When Admitted
Last 3 Days OTC
Purchase
ED Admission
OTC Sales for Region
2004 University of Pittsburgh
Incorporating the Counts of OTC Purchases
Person1 Zip1
OTC count
Approximate
binomial
distribution
with a normal
distribution
Person2 Zip1
OTC count
Eq Class1 Zip1
OTC count
Person3 Zip1
OTC count
Eq Classs2 Zip1
OTC count
Zip1
OTC count
2004 University of Pittsburgh
Person4 Zip1
OTC count
The PANDA OTC Model
2
P (OTC sales = X | ED, PDE )  Normal ( X ;   Ei ,   Ei )
Ei
Recall that:
P ( OTC, ED | PDE ) =
P ( OTC | ED, PDE ) P ( ED | PDE )
2004 University of Pittsburgh
Ei
Example
Equivalence Class 1 ~ Normal(100,100)
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
50-60
Male
15213
Yes
Today
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
2004 University of Pittsburgh
100
150
200
250
300
350
Example
Equivalence Class 1 ~ Normal(100,100)
Equivalence Class 2 ~ Normal(150,225)
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
50-60
Male
15213
Yes
Today
50-60
Female
15213
Yes
Today
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
2004 University of Pittsburgh
100
150
200
250
300
350
Example
Equivalence Class 1 ~ Normal(100,100)
Equivalence Class 2 ~ Normal(150,225)
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
Age
Decile
Gender
Home
Zip
Respiratory
Chief Comp.
Date
Admitted
50-60
Male
15213
Yes
Today
50-60
Female
15213
Yes
Today
If these were the only 2 Equivalence Classes in
the County then
0.045
0.04
County Cough & Cold OTC ~
Normal(100+150,100+225)
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
2004 University of Pittsburgh
100
150
200
250
300
350
Example
Now suppose 260 units are sold in the county
P( OTC Sales = 260 | ED Data, PDE ) =
Normal( 260; 250, 325 ) = 0.001231
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
2004 University of Pittsburgh
100
150
200
250
260
300
350
Inference Timing
Machine: P4 3 Gigahertz, 2 GB RAM
ED model
ED and OTC
model
2004 University of Pittsburgh
Initialization Time Each hour of
(seconds)
data (seconds)
55
5
229
5
A Current Limitation
•
•
•
Problem: Currently we assume
unrealistically that a person only
makes OTC purchases in his or her
home zip code
Approach 1: Aggregate OTC-counts
(e.g., at the county level)
Approach 2: For each home zip
code, model the distribution of zip
codes where OTC purchases are
made
2004 University of Pittsburgh
Outline
1. Introduction
2. Model
3. Inference
4. Conclusions
2004 University of Pittsburgh
Challenges in Population-Wide
Modeling Include …
• Obtaining good parameter estimates to use
in modeling (e.g., the probability of an OTC
cough medication purchase given an acute
respiratory illness)
• Modeling time and space in a way that is
both useful and computationally tractable
• Modeling contagious diseases
2004 University of Pittsburgh
Conclusions
• PANDA is a multivariate algorithm that can
combine multiple data streams
• Modeling each individual in the population
is computationally feasible (so far)
• An evaluation of the PANDA approach to
modeling multiple data streams is in
progress using semi-synthetic test data
2004 University of Pittsburgh
Thank you
Current funding:
National Science Foundation
Department of Homeland Security
Earlier funding:
DARPA
http://www.cbmi.pitt.edu/panda/
gfc@cbmi.pitt.edu
2004 University of Pittsburgh
2004 University of Pittsburgh
The PANDA OTC Model
Model the OTC purchases for each
Equivalence Class Ei as a binomial
Distribution.
Ei ~ Binomial(NEi ,PEi)
2004 University of Pittsburgh
The PANDA OTC Model
Model the OTC purchases for each
Equivalence Class Ei as a binomial
Distribution.
Ei ~ Binomial(NEi ,PEi)
Number of people in
Equivalence Class Ei
2004 University of Pittsburgh
Probability of an OTC cough
medication purchase during
the previous 3 days by each
person in Equivalence Class Ei
The PANDA OTC Model
Model the OTC purchases for each
Equivalence Class Ei as a binomial
Distribution.
Approximate the binomial distribution
as a normal distribution.
Ei ~ Binominal(NEi ,PEi)  Normal(Ei ,2Ei)
2004 University of Pittsburgh
The PANDA OTC Model
Model the OTC purchases for each
Equivalence Class Ei as a binomial
Distribution.
Approximate the binomial distribution
as a normal distribution.
Ei ~ Binominal(NEi ,PEi)  Normal(Ei ,2Ei)
Ei = NEi × PEi
2Ei = NEi × PEi× (1 - PEi)
2004 University of Pittsburgh
Computational Cost of a Population-Wide Approach?
~1.4 million people in Allegheny County, Pennsylvania
2004 University of Pittsburgh
Equivalence Classes
The ~1.4M people in the modeled population can be partitioned
into approximately 24,240 equivalence classes
2004 University of Pittsburgh
Download