DM16: Summarization and Deviation Detection

advertisement
Summarization
and Deviation
Detection
-What is new?
Outline
 Summarization
 KEFIR – Key Findings Reporter
 WSARE – What is Strange About
Recent Events
2
What is New?
Old data
new data
3
Summarization
Concisely summarize what is new and
different, unexpected
 with respect to previous values
 with respect to expected values
…
Focus on what is actionable!
4
Problem: Healthcare Costs
 Healthcare costs in US: 1 out of 7 GDP $ and
rising
 potential problems: fraud, misuse, …
 understanding where the problems are is first step to
fixing them
 GTE – self insured for medical costs
 GTE healthcare costs – $X00,000,000
 Task: Analyze employee health care data and
generate a report that describes the major
problems
5
GTE Key Findings Reporter:
KEFIR
 KEFIR Approach:
 Analyze all possible deviations
 Select interesting findings
 Augment key findings with:
 Explanations of plausible causes
 Recommendations of appropriate actions
 Convert findings to a user-friendly report with text and
graphics
6
KEFIR Search Space
Drill-Down Example
8
What Change Is Important?
9
Deviation Detection
 Drill Down through the search space
 Generate a finding for each measure
 deviation from previous period
 deviation from norm
 deviation projected for next period, if no action
10
Interestingness of Deviations
Impact: how much the deviation affects the bottom line
Savings Percentage: how much of the deviation from the norm
can be expected to be saved by the action
Recommendations
Hierarchical recommendation rules define appropriate
intervention strategies for important measures and study areas.
Example:
If
Then
measure = admission rate per 1000 &
study_area = Inpatient admissions &
percent_change > 0.10
Utilization review is needed in the area of
admission certification.
Expected
Savings: 20%
Explanation
A measure is explained by finding the path of
related measures with the highest impact
The large increase in m1 in group s1 was caused by an
increase in m3, which was caused by a rise in m5 , primarily in
sector s13.
13
Report Generation
 Automatic generation of business-user-oriented
reports
 Natural language generation with template
matching
 Graphics
 delivered via browser
14
Sample KEFIR pages
Overview
Inpatient admissions
16
Status
 Prototype implemented in GTE in 1995
 KEFIR received GTE’s highest award for technical
achievement in 1995
 Key business user left GTE in 1996 and system was no
longer used
 Publication:
 Selecting and Reporting What is Interesting: The KEFIR Application
to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D.
McNeill, in Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1996
What’s Strange About
Recent Events (WSARE)
Weng-Keen Wong (Carnegie Mellon University)
Andrew Moore (Carnegie Mellon University)
Gregory Cooper (University of Pittsburgh)
Michael Wagner (University of Pittsburgh)
http://www.autonlab.org/wsare
Designed to be easily applicable to any date/timeindexed biosurveillance-relevant data stream
Motivation
Suppose we have access to Emergency
Department data from hospitals around a city
(with patient confidentiality preserved)
Primary
Key
Date
Time
Hospital
ICD9
Prodrome
Gender
Age
Home
Location
Work
Location
Many
more…
100
6/1/03
9:12
1
781
Fever
M
20s
NE
?
…
101
6/1/03
10:45
1
787
Diarrhea
F
40s
NE
NE
…
102
6/1/03
11:03
1
786
Respiratory
F
60s
NE
N
…
103
6/1/03
11:07
2
787
Diarrhea
M
60s
E
?
…
104
6/1/03
12:15
1
717
Respiratory
M
60s
E
NE
…
105
6/1/03
13:01
3
780
Viral
F
50s
?
NW
…
106
6/1/03
13:05
3
487
Respiratory
F
40s
SW
SW
…
107
6/1/03
13:57
2
786
Unmapped
M
50s
SE
SW
…
108
6/1/03
14:22
1
780
Viral
M
40s
?
?
…
:
:
:
:
:
:
19
:
:
:
:
:
Traditional Approaches
We need to build a univariate detector to monitor each interesting
combination of attributes:
Number of cases involving
people working in southern
part of the city
Diarrhea cases
among children
Respiratory syndrome
Number of cases involving
You’ll
need
hundreds
of
univariate detectors!
cases among females
teenage girls living in the
We would like to identify the groups
with
the
strangest
western
part
of
the
city
Viral syndrome cases
behavior in recent events.
involving senior citizens
from eastern part of city
Botulinic syndrome cases
Number of children from
downtown hospital
And so on…
20
WSARE Approach
 Rule-Based Anomaly Pattern Detection
 Association rules used to characterize anomalous
patterns. For example, a two-component rule
would be:
Gender = Male AND 40  Age < 50
21
WSARE v2.0 Overview
1. Obtain Recent
and Baseline
datasets
All
Data
Recent
Data
Baseline
22
2. Search for rule with best
score
3. Determine p-value of
best scoring rule through
randomization test
4. If p-value is less than
threshold, signal alert
Step 1: Obtain Recent and Baseline Data
Data from last 24 hours
Recent
Data
Baseline
Baseline data is assumed to
capture non-outbreak
behavior. We use data from
35, 42, 49 and 56 days prior
to the current day
23
Example
Sat 12-23-2001
35.8% (48/134) of today's cases have 30 <= age < 40
17.0% (45/265) of other (baseline) cases have
30 <= age < 40
24
Step 2. Search for Best Rule
For each rule, form a 2x2 contingency table eg.
CountRecent
CountBaseline
Age Decile = 3
48
45
Age Decile  3
86
220
 Perform Fisher’s Exact Test to get a p-value (score)
for each rule (for this data 0.00005)
 Find rule R-best with the lowest score.
 Caution: This score is not the true p-value of RBEST
because of multiple tests
25
Step 3: Randomization Test
June 4, 2002
C2
June 4, 2002
C2
June 5, 2002
C3
June 12, 2002
C3
June 12, 2002
C4
July 31, 2002
C4
June 19, 2002
C5
June 26, 2002
C5
June 26, 2002
C6
July 31, 2002
C6
June 26, 2002
C7
June 5, 2002
C7
July 2, 2002
C8
July 2, 2002
C8
July 3, 2002
C9
July 3, 2002
C9
July 10, 2002
C10
July 10, 2002
C10
July 17, 2002
C11
July 17, 2002
C11
July 24, 2002
C12
July 24, 2002
C12
July 30, 2002
C13
July 30, 2002
C13
July 31, 2002
C14
June 19, 2002
C14
July 31, 2002
C15
June 26, 2002
C15
 Take the recent cases and the baseline cases. Shuffle the date field to
produce a randomized dataset called DBRand
 Find the rule with the best score on DBRand.
26
Step 3: Randomization Test
Repeat the procedure on the
previous slide for 1000
iterations. Determine how
many scores from the 1000
iterations are better than the
original score.
If the original score were here, it would
place in the top 1% of the 1000 scores from
the randomization test. We would be
impressed and an alert should be raised.
Estimated p-value of the rule is:
# better scores / # iterations
27
Results on Actual ED Data
from 2001
1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000
14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False
7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000
12.42% ( 58/467) of today's cases have Respiratory Syndrome = True
6.53% (653/10000) of baseline have Respiratory Syndrome = True
3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000
1.44% ( 9/625) of today's cases have 100 <= Age < 110
0.08% ( 8/10000) of baseline have 100 <= Age < 110
4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000
83.80% (481/574) of today's cases have Unknown Syndrome = False
74.29% (7430/10001) of baseline have Unknown Syndrome = False
5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000
14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False
7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False
28
WSARE 3:0 Improving the Baseline
Baseline
Recall that the baseline was assumed to be captured
by data that was from 35, 42, 49, and 56 days prior
to the current day.
What if this assumption isn’t true?
What if data from 7, 14, 21 and 28
days prior is better?
We would like to determine the baseline
automatically!
29
Temporal Trends
From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E.
(2002). Early statistical detection of anthrax outbreaks by tracking
over-the-counter medication sales. Proceedings of the National
Academy of Sciences (pp. 5237-5249)
30
WSARE v3.0
Generate the baseline…
 “Taking into account recent flu levels…”
 “Taking into account that today is a public holiday…”
 “Taking into account that this is Spring…”
 “Taking into account recent heatwave…”
 “Taking into account that there’s a known natural Foodborne outbreak in progress…”
Bonus: More
efficient use of
historical data
31
Idea: Bayesian Networks
Bayesian Network: A graphical model representing the joint
probability distribution of a set of random variables
“Patients from West Park Hospital
are less likely to be young”
“On Cold Tuesday Mornings the
folks coming in from the North
part of the city are more likely to
have respiratory problems”
“On the day after a major
holiday, expect a boost in the
morning followed by a lull in
the afternoon”
“The Viral prodrome is more
likely to co-occur with a Rash
prodrome than Botulinic” 32
Obtaining Baseline Data
All Historical
Data
1. Learn Bayesian
Network
Today’s
Environment
What should be
happening today given
today’s environment
2. Generate baseline given
today’s environment
Baseline
33
Simulation
FLU LEVEL
DAY OF WEEK
AGE
GENDER
DATE
REGION
SEASON
Outside
Activity
Immune
System
Has
Flu
Heart
Health
Actual
Symptom
ACTION
Has
Allergy
Region Anthrax
Concentration
Region
Grassiness
Region
Food
Condition
Has
Food None, Purchase
Actions:
Poisoning
Medication, ED visit, Absent.
If Action is not None, output
record to dataset.
Disease
REPORTED
SYMPTOM
Has
Anthrax
Has
Sunburn
Has
Cold
Has Heart
Attack
WEATHER
DRUG
34
Simulation
 100 different data sets
 Each data set consisted of a two year period
 Anthrax release occurred at a random point during
the second year
 Algorithms allowed to train on data from the current
day back to the first day in the simulation
 Any alerts before actual anthrax release are
considered a false positive
 Detection time calculated as first alert after anthrax
release. If no alerts raised, cap detection time at 14
days
35
Simulation Plot
36
Anthrax release
(not highest peak)
Results on Simulation
37
Summary
 Summarization of what is new and interesting
 Key ideas
 search many possible findings
 compare to past data and expected data
 avoid overfitting
 focus on actionable changes
 Example systems
 KEFIR (GTE, 1992-1995)
 WSARE (CMU/Pitt, 2002-3)
38
Download