Summarization and Deviation Detection -What is new? Outline Summarization KEFIR – Key Findings Reporter WSARE – What is Strange About Recent Events 2 What is New? Old data new data 3 Summarization Concisely summarize what is new and different, unexpected with respect to previous values with respect to expected values … Focus on what is actionable! 4 Problem: Healthcare Costs Healthcare costs in US: 1 out of 7 GDP $ and rising potential problems: fraud, misuse, … understanding where the problems are is first step to fixing them GTE – self insured for medical costs GTE healthcare costs – $X00,000,000 Task: Analyze employee health care data and generate a report that describes the major problems 5 GTE Key Findings Reporter: KEFIR KEFIR Approach: Analyze all possible deviations Select interesting findings Augment key findings with: Explanations of plausible causes Recommendations of appropriate actions Convert findings to a user-friendly report with text and graphics 6 KEFIR Search Space Drill-Down Example 8 What Change Is Important? 9 Deviation Detection Drill Down through the search space Generate a finding for each measure deviation from previous period deviation from norm deviation projected for next period, if no action 10 Interestingness of Deviations Impact: how much the deviation affects the bottom line Savings Percentage: how much of the deviation from the norm can be expected to be saved by the action Recommendations Hierarchical recommendation rules define appropriate intervention strategies for important measures and study areas. Example: If Then measure = admission rate per 1000 & study_area = Inpatient admissions & percent_change > 0.10 Utilization review is needed in the area of admission certification. Expected Savings: 20% Explanation A measure is explained by finding the path of related measures with the highest impact The large increase in m1 in group s1 was caused by an increase in m3, which was caused by a rise in m5 , primarily in sector s13. 13 Report Generation Automatic generation of business-user-oriented reports Natural language generation with template matching Graphics delivered via browser 14 Sample KEFIR pages Overview Inpatient admissions 16 Status Prototype implemented in GTE in 1995 KEFIR received GTE’s highest award for technical achievement in 1995 Key business user left GTE in 1996 and system was no longer used Publication: Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 What’s Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) http://www.autonlab.org/wsare Designed to be easily applicable to any date/timeindexed biosurveillance-relevant data stream Motivation Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved) Primary Key Date Time Hospital ICD9 Prodrome Gender Age Home Location Work Location Many more… 100 6/1/03 9:12 1 781 Fever M 20s NE ? … 101 6/1/03 10:45 1 787 Diarrhea F 40s NE NE … 102 6/1/03 11:03 1 786 Respiratory F 60s NE N … 103 6/1/03 11:07 2 787 Diarrhea M 60s E ? … 104 6/1/03 12:15 1 717 Respiratory M 60s E NE … 105 6/1/03 13:01 3 780 Viral F 50s ? NW … 106 6/1/03 13:05 3 487 Respiratory F 40s SW SW … 107 6/1/03 13:57 2 786 Unmapped M 50s SE SW … 108 6/1/03 14:22 1 780 Viral M 40s ? ? … : : : : : : 19 : : : : : Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Number of cases involving people working in southern part of the city Diarrhea cases among children Respiratory syndrome Number of cases involving You’ll need hundreds of univariate detectors! cases among females teenage girls living in the We would like to identify the groups with the strangest western part of the city Viral syndrome cases behavior in recent events. involving senior citizens from eastern part of city Botulinic syndrome cases Number of children from downtown hospital And so on… 20 WSARE Approach Rule-Based Anomaly Pattern Detection Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40 Age < 50 21 WSARE v2.0 Overview 1. Obtain Recent and Baseline datasets All Data Recent Data Baseline 22 2. Search for rule with best score 3. Determine p-value of best scoring rule through randomization test 4. If p-value is less than threshold, signal alert Step 1: Obtain Recent and Baseline Data Data from last 24 hours Recent Data Baseline Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day 23 Example Sat 12-23-2001 35.8% (48/134) of today's cases have 30 <= age < 40 17.0% (45/265) of other (baseline) cases have 30 <= age < 40 24 Step 2. Search for Best Rule For each rule, form a 2x2 contingency table eg. CountRecent CountBaseline Age Decile = 3 48 45 Age Decile 3 86 220 Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data 0.00005) Find rule R-best with the lowest score. Caution: This score is not the true p-value of RBEST because of multiple tests 25 Step 3: Randomization Test June 4, 2002 C2 June 4, 2002 C2 June 5, 2002 C3 June 12, 2002 C3 June 12, 2002 C4 July 31, 2002 C4 June 19, 2002 C5 June 26, 2002 C5 June 26, 2002 C6 July 31, 2002 C6 June 26, 2002 C7 June 5, 2002 C7 July 2, 2002 C8 July 2, 2002 C8 July 3, 2002 C9 July 3, 2002 C9 July 10, 2002 C10 July 10, 2002 C10 July 17, 2002 C11 July 17, 2002 C11 July 24, 2002 C12 July 24, 2002 C12 July 30, 2002 C13 July 30, 2002 C13 July 31, 2002 C14 June 19, 2002 C14 July 31, 2002 C15 June 26, 2002 C15 Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DBRand Find the rule with the best score on DBRand. 26 Step 3: Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Estimated p-value of the rule is: # better scores / # iterations 27 Results on Actual ED Data from 2001 1. Sat 2001-02-13: SCORE = -0.00000004 PVALUE = 0.00000000 14.80% ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat 2001-03-13: SCORE = -0.00000464 PVALUE = 0.00000000 12.42% ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed 2001-06-30: SCORE = -0.00000013 PVALUE = 0.00000000 1.44% ( 9/625) of today's cases have 100 <= Age < 110 0.08% ( 8/10000) of baseline have 100 <= Age < 110 4. Sun 2001-08-08: SCORE = -0.00000007 PVALUE = 0.00000000 83.80% (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu 2001-12-02: SCORE = -0.00000087 PVALUE = 0.00000000 14.71% ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 28 WSARE 3:0 Improving the Baseline Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. What if this assumption isn’t true? What if data from 7, 14, 21 and 28 days prior is better? We would like to determine the baseline automatically! 29 Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp. 5237-5249) 30 WSARE v3.0 Generate the baseline… “Taking into account recent flu levels…” “Taking into account that today is a public holiday…” “Taking into account that this is Spring…” “Taking into account recent heatwave…” “Taking into account that there’s a known natural Foodborne outbreak in progress…” Bonus: More efficient use of historical data 31 Idea: Bayesian Networks Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables “Patients from West Park Hospital are less likely to be young” “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic” 32 Obtaining Baseline Data All Historical Data 1. Learn Bayesian Network Today’s Environment What should be happening today given today’s environment 2. Generate baseline given today’s environment Baseline 33 Simulation FLU LEVEL DAY OF WEEK AGE GENDER DATE REGION SEASON Outside Activity Immune System Has Flu Heart Health Actual Symptom ACTION Has Allergy Region Anthrax Concentration Region Grassiness Region Food Condition Has Food None, Purchase Actions: Poisoning Medication, ED visit, Absent. If Action is not None, output record to dataset. Disease REPORTED SYMPTOM Has Anthrax Has Sunburn Has Cold Has Heart Attack WEATHER DRUG 34 Simulation 100 different data sets Each data set consisted of a two year period Anthrax release occurred at a random point during the second year Algorithms allowed to train on data from the current day back to the first day in the simulation Any alerts before actual anthrax release are considered a false positive Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days 35 Simulation Plot 36 Anthrax release (not highest peak) Results on Simulation 37 Summary Summarization of what is new and interesting Key ideas search many possible findings compare to past data and expected data avoid overfitting focus on actionable changes Example systems KEFIR (GTE, 1992-1995) WSARE (CMU/Pitt, 2002-3) 38