What's Strange About Recent Events (WSARE)

advertisement
What’s Strange About Recent
Events (WSARE)
Weng-Keen Wong (Carnegie Mellon University)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)
DIMACS Tutorial on Statistical and Other Analytic Health Surveillance Methods
Motivation
Suppose we have access to Emergency
Department data from hospitals around a city
(with patient confidentiality preserved)
Primary
Key
Date
Time
Hospital
ICD9
Prodrome
Gender
Age
Home
Location
Work
Location
Many
more…
100
6/1/03
9:12
1
781
Fever
M
20s
NE
?
…
101
6/1/03
10:45
1
787
Diarrhea
F
40s
NE
NE
…
102
6/1/03
11:03
1
786
Respiratory
F
60s
NE
N
…
103
6/1/03
11:07
2
787
Diarrhea
M
60s
E
?
…
104
6/1/03
12:15
1
717
Respiratory
M
60s
E
NE
…
105
6/1/03
13:01
3
780
Viral
F
50s
?
NW
…
106
6/1/03
13:05
3
487
Respiratory
F
40s
SW
SW
…
107
6/1/03
13:57
2
786
Unmapped
M
50s
SE
SW
…
108
6/1/03
14:22
1
780
Viral
M
40s
?
?
…
:
:
:
:
:
:
:
:
:
:
:
The Problem
From this data, can we detect if a disease
outbreak is happening?
The Problem
From this data, can we detect if a disease
outbreak is happening?
We’re talking about a nonspecific disease detection
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we
detect it?
The Problem
From this data, can we detect if a disease
outbreak is happening? How early can we
detect it?
The question we’re really asking:
In the last n hours, has anything strange
happened?
Traditional Approaches
What about using traditional anomaly detection?
• Typically assume data is generated by a model
• Finds individual data points
that have low probability
with respect to this model
• These outliers have rare
attributes or combinations
of attributes
• Need to identify anomalous patterns not
isolated data points
Traditional Approaches
What about monitoring aggregate
daily counts of certain attributes?
50
40
30
20
10
100
91
82
73
64
55
46
37
28
19
10
0
1
Number of ED Visits
• We’ve now turned
multivariate data into
univariate data
• Lots of algorithms have been
developed for monitoring
univariate data:
Num ber of ED Visits per Day
Day Num ber
– Time series algorithms
– Regression techniques
– Statistical Quality Control methods
• Need to know apriori which attributes to form daily aggregates for!
Traditional Approaches
What if we don’t know what attributes to
monitor?
Traditional Approaches
What if we don’t know what attributes to
monitor?
What if we want to exploit the spatial,
temporal and/or demographic characteristics
of the epidemic to detect the outbreak as
early as possible?
Traditional Approaches
We need to build a univariate detector to monitor each interesting
combination of attributes:
Diarrhea cases
among children
Respiratory syndrome
cases among females
Viral syndrome cases
involving senior citizens
from eastern part of city
Number of children from
downtown hospital
Number of cases involving
people working in southern
part of the city
Number of cases involving
teenage girls living in the
western part of the city
Botulinic syndrome cases
And so on…
Traditional Approaches
We need to build a univariate detector to monitor each interesting
combination of attributes:
Diarrhea cases
among children
Number of cases involving
people working in southern
part of the city
Respiratory syndrome
Number of cases involving
You’ll
need
hundreds
of
univariate detectors!
cases among females
teenage girls living in the
We would like to identify the groups
with
the
strangest
western
part
of
the
city
Viral syndrome cases
behavior in recent events.
involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from
downtown hospital
And so on…
Our Approach
• We use Rule-Based Anomaly Pattern Detection
• Association rules used to characterize anomalous
patterns. For example, a two-component rule
would be:
Gender = Male AND 40  Age < 50
• Related work:
–
–
–
–
Market basket analysis [Agrawal et. al, Brin et. al.]
Contrast sets [Bay and Pazzani]
Spatial Scan Statistic [Kulldorff]
Association Rules and Data Mining in Hospital
Infection Control and Public Health Surveillance
[Brossette et. al.]
WSARE v2.0
• Inputs:
1. Multivariate
date/time-indexed
biosurveillancerelevant data stream
2. Time Window
Length
“Emergency
Department Data”
3. Which
attributes to use?
“Ignore key”
“Last 24 hours”
Primary
Key
Date
Time
Hospital
ICD9
Prodrome
Gender
Age
Home
Location
Work
Location
Many
more…
100
6/1/03
9:12
1
781
Fever
M
20s
NE
?
…
101
6/1/03
10:45
1
787
Diarrhea
F
40s
NE
NE
…
102
6/1/03
11:03
1
786
Respiratory
F
60s
NE
N
…
:
:
:
:
:
:
:
:
:
:
:
WSARE v2.0
• Inputs:
• Outputs:
1. Multivariate
date/time-indexed
biosurveillancerelevant data stream
1. Here are the
records that most
surprise me
2. Time Window
Length
3. Which
attributes to use?
2. Here’s why
3. And here’s how
seriously you
should take it
Primary
Key
Date
Time
Hospital
ICD9
Prodrome
Gender
Age
Home
Location
Work
Location
Many
more…
100
6/1/03
9:12
1
781
Fever
M
20s
NE
?
…
101
6/1/03
10:45
1
787
Diarrhea
F
40s
NE
NE
…
102
6/1/03
11:03
1
786
Respiratory
F
60s
NE
N
…
:
:
:
:
:
:
:
:
:
:
:
WSARE v2.0 Overview
1. Obtain Recent
and Baseline
datasets
All
Data
Recent
Data
Baseline
2. Search for rule with best
score
3. Determine p-value of
best scoring rule through
randomization test
4. If p-value is less than
threshold, signal alert
Step 1: Obtain Recent and Baseline Data
Data from last 24 hours
Recent
Data
Baseline
Baseline data is assumed to
capture non-outbreak
behavior. We use data from
35, 42, 49 and 56 days prior
to the current day
Step 2. Search for Best Scoring Rule
For each rule, form a 2x2 contingency table eg.
CountRecent
CountBaseline
Age Decile = 3
48
45
Age Decile  3
86
220
• Perform Fisher’s Exact Test to get a p-value for
each rule => call this p-value the “score”
• Take the rule with the lowest score. Call this rule
RBEST.
• This score is not the true p-value of RBEST because
we are performing multiple hypothesis tests on
each day to find the rule with the best score
The Multiple Hypothesis
Testing Problem
• Suppose we reject null hypothesis when
score < , where  = 0.05
• For a single hypothesis test, the probability
of making a false discovery = 
• Suppose we do 1000 tests, one for each
possible rule
• Probability(false discovery) could be as bad
as: 1 – ( 1 – 0.05)1000 >> 0.05
Step 3: Randomization Test
June 4, 2002
C2
June 4, 2002
C2
June 5, 2002
C3
June 12, 2002
C3
June 12, 2002
C4
July 31, 2002
C4
June 19, 2002
C5
June 26, 2002
C5
June 26, 2002
C6
July 31, 2002
C6
June 26, 2002
C7
June 5, 2002
C7
July 2, 2002
C8
July 2, 2002
C8
July 3, 2002
C9
July 3, 2002
C9
July 10, 2002
C10
July 10, 2002
C10
July 17, 2002
C11
July 17, 2002
C11
July 24, 2002
C12
July 24, 2002
C12
July 30, 2002
C13
July 30, 2002
C13
July 31, 2002
C14
June 19, 2002
C14
July 31, 2002
C15
June 26, 2002
C15
• Take the recent cases and the baseline cases. Shuffle the date field to
produce a randomized dataset called DBRand
• Find the rule with the best score on DBRand.
Step 3: Randomization Test
Repeat the procedure on the
previous slide for 1000
iterations. Determine how
many scores from the 1000
iterations are better than the
original score.
If the original score were here, it would
place in the top 1% of the 1000 scores from
the randomization test. We would be
impressed and an alert should be raised.
Estimated p-value of the rule is:
# better scores / # iterations
Two Kinds of Analysis
Day by Day
Historical Analysis
• If we want to run WSARE
just for the current day…
…then we end here.
• If we want to review all
previous days and their pvalues for several years
and control for some
percentage of false
positives…
…then we’ll once again
run into overfitting
problems
…we need to compensate
for multiple hypothesis
testing because we
perform a hypothesis test
on each day in the history
We only need to do this for historical analysis!
False Discovery Rate [Benjamini and Hochberg]
• Can determine which of these p-values are
significant
• Specifically, given an αFDR, FDR guarantees that
# false positives
  FDR
# tests in which null hyp was rejected
• Given an αFDR, FDR produces a threshold below
which any p-values in the history are considered
significant
WSARE v3.0
WSARE v2.0 Review
1. Obtain Recent
and Baseline
datasets
All
Data
Recent
Data
Baseline
2. Search for rule with best
score
3. Determine p-value of
best scoring rule through
randomization test
4. If p-value is less than
threshold, signal alert
Obtaining the Baseline
Baseline
Recall that the baseline was assumed to be captured
by data that was from 35, 42, 49, and 56 days prior
to the current day.
Obtaining the Baseline
Baseline
Recall that the baseline was assumed to be captured
by data that was from 35, 42, 49, and 56 days prior
to the current day.
What if this assumption isn’t true?
What if data from 7, 14, 21 and 28
days prior is better?
We would like to determine the baseline
automatically!
Temporal Trends
• But health care data has many different
trends due to
–
–
–
–
Seasonal effects in temperature and weather
Day of Week effects
Holidays
Etc.
• Allowing the baseline to be affected by
these trends may dramatically alter the
detection time and false positives of the
detection algorithm
Temporal Trends
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E.
(2002). Early statistical detection of anthrax outbreaks by tracking
over-the-counter medication sales. Proceedings of the National
Academy of Sciences (pp. 5237-5249)
WSARE v3.0
Generate the baseline…
• “Taking into account recent flu levels…”
• “Taking into account that today is a public holiday…”
• “Taking into account that this is Spring…”
• “Taking into account recent heatwave…”
• “Taking into account that there’s a known natural Foodborne outbreak in progress…”
Bonus: More
efficient use of
historical data
Signal
Conditioning on observed environment: Well
understood for Univariate Time Series
Time
Example Signals:
•
•
•
•
•
Number of ED visits today
Number of ED visits this hour
Number of Respiratory Cases Today
School absenteeism today
Nyquil Sales today
An easy case
Signal
Upper Safe Range
Mean
Time
Dealt with by Statistical Quality Control
Record the mean and standard deviation up
the the current time.
Signal an alarm if we go outside 3 sigmas
Signal
Conditioning on Seasonal Effects
Time
Signal
Conditioning on Seasonal Effects
Time
Fit a periodic function (e.g. sine wave) to previous
data. Predict today’s signal and 3-sigma
confidence intervals. Signal an alarm if we’re off.
Reduces False alarms from Natural outbreaks.
Different times of year deserve different thresholds.
Example [Tsui et. Al]
Weekly counts of P&I from week 1/98 to 48/00
From: “Value of ICD-9–Coded Chief
Complaints for Detection of
Epidemics”, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, ChungChou Ho Chang, AMIA 2000
Seasonal Effects with Long-Term Trend
Weekly counts of
IS from week
1/98 to 48/00.
From: “Value of ICD-9–Coded Chief
Complaints for Detection of
Epidemics”, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, ChungChou Ho Chang, AMIA 2000
Seasonal Effects with Long-Term Trend
Called the Serfling
Method [Serfling, 1963]
Weekly counts of
IS from week
1/98 to 48/00.
From: “Value of ICD-9–Coded Chief
Complaints for Detection of
Epidemics”, Fu-Chiang Tsui, Michael
M. Wagner, Virginia Dato, ChungChou Ho Chang, AMIA 2000
Fit a periodic function (e.g. sine wave) plus a linear
trend:
E[Signal] = a + bt + c sin(d + t/365)
Good if there’s a long term trend in the disease or
the population.
Day-of-week effects
From: Goldenberg, A., Shmueli, G.,
Caruana, R. A., and Fienberg, S. E.
(2002). Early statistical detection of
anthrax outbreaks by tracking over-thecounter medication sales. Proceedings
of the National Academy of Sciences
(pp. 5237-5249)
Day-of-week effects
Another simple
form of ANOVA
From: Goldenberg, A., Shmueli, G.,
Caruana, R. A., and Fienberg, S. E.
(2002). Early statistical detection of
anthrax outbreaks by tracking over-thecounter medication sales. Proceedings
of the National Academy of Sciences
(pp. 5237-5249)
Fit a day-of-week component
E[Signal] = a + deltaday
E.G: deltamon= +5.42, deltatue= +2.20, deltawed=
+3.33, deltathu= +3.10, deltafri= +4.02, deltasat= 12.2, deltasun= -23.42
Analysis of variance (ANOVA)
• Good news:
If you’re tracking a daily aggregate (univariate
data)…then ANOVA can take care of many of
these effects.
• But…
What if you’re tracking a whole joint distribution
of events?
Idea: Bayesian Networks
Bayesian Network: A graphical model representing the joint
probability distribution of a set of random variables
“Patients from West Park Hospital
are less likely to be young”
“On Cold Tuesday Mornings the
folks coming in from the North
part of the city are more likely to
have respiratory problems”
“On the day after a major
holiday, expect a boost in the
morning followed by a lull in
the afternoon”
“The Viral prodrome is more
likely to co-occur with a Rash
prodrome than Botulinic”
WSARE Overview
1. Obtain Recent
and Baseline
datasets
All
Data
Recent
Data
Baseline
2. Search for rule with best
score
3. Determine p-value of
best scoring rule through
randomization test
4. If p-value is less than
threshold, signal alert
Obtaining Baseline Data
All Historical
Data
1. Learn Bayesian
Network
Today’s
Environment
2. Generate baseline given
today’s environment
Baseline
Obtaining Baseline Data
All Historical
Data
1. Learn Bayesian
Network
Today’s
Environment
What should be
happening today given
today’s environment
2. Generate baseline given
today’s environment
Baseline
Step 1: Learning the Bayes Net Structure
Involves searching over DAGs for the structure that maximizes a
scoring function. Most common algorithm is hillclimbing.
Initial Structure
3 possible operations:
Add an arc
Delete an arc
Reverse an arc
Step 1: Learning the Bayes Net Structure
Involves searching over DAGs for the structure that maximizes a
scoring function. Most common algorithm is hillclimbing.
Initial Structure
But hillclimbing is too slow and single link modifications may not
find the correct structure (Xiang, Wong and Cercone 1997). We
use Optimal Reinsertion (Moore and Wong 2002).
3 possible operations:
Add an arc
Delete an arc
Reverse an arc
Optimal Reinsertion
T
1. Select target node in
current graph
2. Remove all arcs
connected to T
T
Optimal Reinsertion
?
?
?
3. Efficiently find new
in/out arcs
T
?
?
?
?
?
4. Choose best new way to
connect T
T
The Outer Loop
Until no change in current DAG:
• Generate random ordering of nodes
• For each node in the ordering, do Optimal Reinsertion
The Outer Loop
For NumJolts:
• Begin with randomly corrupted version of best DAG so far
Until no change in current DAG:
• Generate random ordering of nodes
• For each node in the ordering, do Optimal Reinsertion
The Outer Loop
For NumJolts:
• Begin with randomly corrupted version of best DAG so far
Until no change in current DAG:
• Generate random ordering of nodes
• For each node in the ordering, do Optimal Reinsertion
Conventional hill-climbing without maxParams restriction
How is Optimal Reinsertion done efficiently?
P1
P2
P3
Scoring functions can be decomposed:
m
T
DagScore ( D)   NodeScore( PS (i ) i )
i 1
Efficiency Tricks
1. Create an efficient cache of NodeScore(PS->T) values using
ADSearch [Moore and Schneider 2002]
2. Restrict PS->T combinations to those with CPTs with
maxParams or fewer parameters
3. Additional Branch and Bound is used to restrict space an
additional order of magnitude
Environmental Attributes
Divide the data into two types of attributes:
• Environmental attributes: attributes that
cause trends in the data eg. day of week,
season, weather, flu levels
• Response attributes: all other nonenvironmental attributes
Environmental Attributes
When learning the Bayesian network structure, do not allow
environmental attributes to have parents.
Why?
• We are not interested in predicting their distributions
• Instead, we use them to predict the distributions of the response
attributes
Side Benefit: We can speed up the structure search by avoiding
DAGs that assign parents to the environmental attributes
Season
Day of Week
Weather
Flu Level
Step 2: Generate Baseline Given Today’s Environment
Suppose we know the
following for today:
We fill in these
values for the
environmental
attributes in the
learned Bayesian
network
Today
Season =
Winter
We sample 10000 records
from the Bayesian network
and make this data set the
baseline
Season
Day of Week
Weather
Flu Level
Winter
Monday
Snow
High
Day of Week =
Monday
Weather =
Snow
Baseline
Flu Level =
High
Step 2: Generate Baseline Given Today’s Environment
Suppose we know the
following for today:
We
fill in these
Sampling
is easy
valuesbecause
for the
environmental
environmental
attributes are
in the
attributes
at the
learned
top
of theBayesian
Bayes Net
network
Today
Season =
Winter
We sample 10000 records
from the Bayesian network
and make this data set the
baseline
Season
Day of Week
Weather
Flu Level
Winter
Monday
Snow
High
Day of Week =
Monday
Weather =
Snow
Baseline
Flu Level =
High
Why not use inference?
• With sampling, we create the baseline data
and then use it to obtain the p-value of the
rule for the randomization test
• If we used inference, we will not be able to
perform the same randomization test and we
need to find some other way to correct for
the multiple hypothesis testing
• Sampling was chosen for its simplicity
Why not use inference?
• With sampling, we create the baseline data
and then use it to obtain the p-value of the
rule for the randomization test
• If we used inference, we will not be able to
perform the same randomization test and we
need to find some other way to correct for
the multiple hypothesis testing
• Sampling was chosen for its simplicity
But there may be clever things to do with inference
which may help us. File this under future work
Simulation
City with 9 regions and different
population in each region
NW
100
N
400
NE
500
W
100
C
200
E
300
SW
200
S
200
SE
600
For each day, sample the city’s environment from the following
Bayesian Network
Date
Previous
Weather
Day
of Week
Season
Weather
Previous
Flu Level
Flu Level
Previous
Region Food
Condition
Region Food
Condition
Previous
Region Anthrax
Concentration
Region Anthrax
Concentration
Simulation
FLU LEVEL
DAY OF WEEK
AGE
GENDER
DATE
REGION
SEASON
Outside
Activity
Immune
System
Has
Flu
Heart
Health
Actual
Symptom
ACTION
Has
Allergy
Region Anthrax
Concentration
Region
Grassiness
Region
Food
Condition
Has Food
Poisoning
Disease
REPORTED
SYMPTOM
Has
Anthrax
Has
Sunburn
Has
Cold
Has Heart
Attack
WEATHER
DRUG
For each person in a region,
sample their profile
Visible Environmental Attributes
FLU LEVEL
DAY OF WEEK
AGE
GENDER
DATE
REGION
SEASON
Outside
Activity
Immune
System
Has
Flu
Heart
Health
Actual
Symptom
ACTION
Has
Allergy
Has Food
Poisoning
Disease
REPORTED
SYMPTOM
Has
Anthrax
Has
Sunburn
Has
Cold
Has Heart
Attack
WEATHER
DRUG
Region Anthrax
Concentration
Region
Grassiness
Region
Food
Condition
Simulation
FLU LEVEL
DAY OF WEEK
AGE
GENDER
DATE
REGION
SEASON
Outside
Activity
Immune
System
Has
Flu
Heart
Health
Actual
Symptom
ACTION
Has
Allergy
Region Anthrax
Concentration
Region
Grassiness
Region
Food
Condition
Has Food
Poisoning
Disease
REPORTED
SYMPTOM
Has
Anthrax
Has
Sunburn
Has
Cold
Has Heart
Attack
WEATHER
DRUG
Diseases: Allergy, cold, sunburn, flu,
food poisoning, heart problems,
anthrax (in order of precedence)
Simulation
FLU LEVEL
DAY OF WEEK
AGE
GENDER
DATE
REGION
SEASON
Outside
Activity
Immune
System
Has
Flu
Heart
Health
Actual
Symptom
ACTION
Has
Allergy
Region Anthrax
Concentration
Region
Grassiness
Region
Food
Condition
Has
Food None, Purchase
Actions:
Poisoning
Medication, ED visit, Absent.
If Action is not None, output
record to dataset.
Disease
REPORTED
SYMPTOM
Has
Anthrax
Has
Sunburn
Has
Cold
Has Heart
Attack
WEATHER
DRUG
Simulation Plot
Simulation Plot
Anthrax release
(not highest peak)
Simulation
• 100 different data sets
• Each data set consisted of a two year period
• Anthrax release occurred at a random point during
the second year
• Algorithms allowed to train on data from the
current day back to the first day in the simulation
• Any alerts before actual anthrax release are
considered a false positive
• Detection time calculated as first alert after
anthrax release. If no alerts raised, cap detection
time at 14 days
Other Algorithms used in Simulation
1. Standard algorithm
Signal
Upper Safe Range
Mean
Time
2. WSARE 2.0
3. WSARE 2.5
• Use all past data but condition on environmental attributes
Results on Simulation
Conclusion
• One approach to biosurveillance: one algorithm
monitoring millions of signals derived from
multivariate data
instead of
Hundreds of univariate detectors
• WSARE is best used as a general purpose safety
net in combination with other detectors
• Modeling historical data with Bayesian Networks
to allow conditioning on unique features of today
• Computationally intense unless we use clever
algorithms
Conclusion
• WSARE 2.0 deployed during the past year
• WSARE 3.0 about to go online
• WSARE now being extended to
additionally exploit over the counter
medicine sales
For more information
References:
• Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2002).
Rule-based Anomaly Pattern Detection for Detecting Disease
Outbreaks. Proceedings of AAAI-02 (pp. 217-223). MIT Press.
• Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2003).
Bayesian Network Anomaly Pattern Detection for Disease Outbreaks.
Proceedings of ICML 2003.
• Moore, A., and Wong, W. K. (2003). Optimal Reinsertion: A New
Search Operator for Accelerated and More Accurate Bayesian
Network Structure Learning. Proceedings of ICML 2003.
AUTON lab website: http://www.autonlab.org/wsare
Email: wkw@cs.cmu.edu
Download