Developments in the Roles, Features, and Evaluation of Alerting Algorithms

advertisement
ALERTING ALGORITHMS
A
Developments in the Roles, Features,
and Evaluation of Alerting Algorithms
for Disease Outbreak Monitoring
Howard S. Burkom, Yevgeniy Elbert, Steven F. Magruder,
Amir H. Najmi, William Peter, and Michael W. Thompson
utomated systems for public health surveillance have
evolved over the past several years as national and local
institutions have been learning the most effective ways
to share and apply population health information for outbreak investigation and
tracking. The changes have included developments in algorithmic alerting methodology. This article presents research efforts at The Johns Hopkins University Applied
Physics Laboratory for advancing this methodology. The analytic methods presented
cover outcome variable selection, background estimation, determination of anomalies for alerting, and practical evaluation of detection performance.The methods and
measures are adapted from information theory, signal processing, financial forecasting, and radar engineering for effective use in the biosurveillance data environment.
Examples are restricted to univariate algorithms for daily time series of syndromic
data, with discussion of future generalization and enhancement.
Evolving Role of Alerting Algorithms in Health
Surveillance
Research has been ongoing to adapt and implement
health surveillance algorithms for decades and has
accelerated since the late 1990s because of concern over
the threat of a clandestine bioterrorist attack.1 Methods
have been imported from many disciplines in which
prospective signal detection has been applied, and
some experience has been gained. Successes have been
limited because of the rarity of large-scale outbreaks,
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
the lack of documentation of community-level ones,
and the lack of consensus over which data effects are
outbreak signals and which should be considered background noise. However, the application continues to
grow more important and more difficult because of
advances in public health informatics and the increased
perception of natural disease threats, such as pandemic
influenza, along with bioterrorism concerns. More and
313­­­­
H. S. BURKOM et al.
increasingly complex data streams are available to health
monitors, methodologies for monitoring them need to
keep pace, and automated algorithms are needed to direct
the attention of investigators to potential problems amid
this ocean of information.
Expert knowledge—a combination of local background data experience, medical knowledge, and understanding of population behavior—has not proven sufficient to bridge the gap between statistical and epidemiological significance of algorithmic alerts. The use of
automated biosurveillance systems has thus produced
false-alarm problems,2 and most of the reported benefit
of these systems has been in combination with other surveillance tools, as in corroboration of medical suspicions
and a broadening of situational awareness. Current algorithm research is addressing how to improve detection
performance for greater specificity and broader utility.
Figure 1 sketches the roles of alerting algorithms in a
biosurveillance system, with elements of routine operations on the right and design and evaluation functions
on the left.
The algorithmic challenges are as follows:
• To capture the best datasets for making effects of an
outbreak stand out as much as possible from usual
data behavior.
• To filter the records to produce time series for additional outbreak discrimination, i.e., for maximizing
the signal-to-noise ratio.3
• To develop and apply robust, sensitive algorithms
tuned to these time series to identify the signal as early
as possible. Key components of these algorithms are
prediction and anomaly determination.4 Prediction
is important because unusual data behavior cannot
be recognized without estimates of usual behavior.
Systematic time series effects such as seasonal cycles,
day-of-week usage patterns, and regular clinic closings are common features in syndromic time series.
Various forecasting approaches have been applied to
remove expected trends and outliers so that the control charts could be applied to the forecast residuals to
determine when to alert.
To provide perspective on development of an entire
biosurveillance system, the right half of Fig. 1 shows the
additional challenges of data acquisition, cleaning, and
transfer at the top and the challenge of useful visualization for human interpretation at the bottom. The design
and maintenance of the data chain are especially important; the utility of algorithms, output products, and
interfaces depends on the prompt availability of secure
data at the required level of detail.
The remaining sections of this article discuss our
recent efforts on algorithmic subtasks depicted in Fig.
1 and applied for the ongoing development of the Electronic Surveillance System for the Early Notification
of Community-based Epidemics (ESSENCE). For more
314
Routine Surveillance
Operations
Data source evaluation
for surveillance utility
Data capture, cleaning,
and transmission
Filtering for outcome
variable selection
Domain of algorithm
research and
development
Background
estimation
Anomaly
determination
Detection performance
evaluation
Visualization for
epidemiological
investigation
Figure 1. Components of biosurveillance systems, including the
routine algorithmic alerting process.
background information on ESSENCE, see the article on
biosurveillance research and policy tradeoffs by Burkom
et al. elsewhere in this issue. These sections present
• The challenge of outcome variable selection, featuring a discussion of mutual information, its application to a surveillance time series, and a reference
series to determine whether their quotient, such as
the division of daily syndromic counts by total daily
facility visits to estimate a population-level rate, can
reduce the signal-to-noise ratio
• An adaptive, recursive least-squares (RLS) algorithm
tailored for univariate, city-level syndromic time
series with applications for multivariate analysis
• Improvements in implemented algorithms previously
reported and a strategy for replacing them with a
more data-adaptive approach
Although detailed material in this article is restricted
to univariate algorithms, multivariate alerting methods
also are conservatively applied in ESSENCE systems,
and their broader application is an active research area
whose niche in surveillance practice is still to be determined. The final section discusses research challenges
of current interest for both univariate and multivariate
detection methods.
Use of Mutual Information for Improved
Signal-to-Noise Ratio
Monitoring in the Absence of a Static Baseline
Most surveillance systems are vulnerable to dramatic
and unpredictable shifts in the monitored health care
data3 resulting from changes in data collection methods, diagnosis coding, participating clinical facilities,
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
insurance eligibility, and similar factors. Reis et al.4
have noted that anomaly detection methods in most
surveillance systems are not robust to shifts in health
care utilization because they cannot adjust quickly to
changing baselines, so that utilization shifts may trigger false alarms. As a result, the effects of public health
crises and major public events may undermine health
surveillance systems at the very times they are needed
most.4 Baseline changes may trigger irrelevant alarms
and mask signals caused by population health events of
concern. For example, a sudden jump in influenza-like
illness diagnoses in a population might be caused by an
increase in the size or perhaps insurance eligibility of
the monitored population and not by a real disease outbreak. In addition, the total number of medical facility
visits could rise because of the addition of a hospital into
the network, but the hospital might not have a clinic
treating the syndrome of interest.3
Avoiding false alerts in syndromic monitoring by
comparing data to a baseline population is made difficult by the lack of a stable data background. True incidence rates, classically measured by the number of new
cases in a given time period divided by the population
size,5 can rarely be calculated in syndromic surveillance
data because catchment area sizes for the data sources
typically are unavailable. These data sources include
records of deidentified and filtered clinical encounter
records, pharmaceutical prescriptions, or selected overthe-counter remedy purchases. Although it is possible
to estimate the population in a particular geographical catchment area, this estimate may not be useful as
a denominator for proportion monitoring because it
cannot include patient preferences, temporary closures,
or sales promotions in pharmacies or diagnostic laboratories or large transient population inflow or outflow
caused by long holiday weekends or large public events
(political conventions, the SuperBowl, etc.). The number
of people served by a military hospital is particularly
difficult to estimate because of the transitory nature of
military lifestyles.
Syndromic Data Categorization
In biosurveillance, a syndrome grouping is defined
as being a filtering of clinical prediagnostic records
designed to clarify the data signal resulting from outbreaks of certain disease types.6 Clinical encounter data
compiled and formatted by systems such as ESSENCE7
are cleaned to remove duplicate, incomplete, or ambiguous entries. Before records are available for analysis, data
fields with personal identifiers are removed or altered to
comply with privacy requirements while enough is made
available for accurate filtering and analysis. The data
streams available for alerting algorithms typically consist of these cleaned records and may come from various
sources (e.g., syndromic hospital visit counts and school
absentee rates).
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
Challenge of Minimizing False Alerts While Maintaining
Signal-to-Noise Ratio
One approach to reducing false alarms in syndromic
data streams is to approximate true rates by replacing
daily filtered diagnosis counts with the quotient of these
counts by some reference series, such as the total of all
daily visits. The Centers for Disease Control and Prevention (CDC) uses this approach for seasonal influenza surveillance by monitoring weekly percentages of
influenza-like illness submitted by participating sentinel
physicians. The rationale for monitoring ratios instead
of counts is that general health care utilization shifts
and other data features irrelevant to disease surveillance
often affect both numerator and denominator and thus
are reduced or eliminated in the ratio, thereby lowering the false-alarm rate. Thus, changes in the syndromic data stream (the ratio’s numerator, denoted below
as the “target” data stream4) relative to a changing
background (the denominator, denoted the “context”
stream) are drawn out, and these changes are presumably more likely to indicate genuine increases in the
syndrome group of interest.
An example of the signal-to-noise ratio increase
obtained by using proportions instead of counts is illustrated in Fig. 2. Two plots are shown in the upper half of
the figure: the red line is a plot of the daily total number
of visits recorded in a health care facility, and the blue
line shows counts of only those visits classified in the
respiratory syndrome. A small simulated outbreak is
indicated in the count series. The outbreak signal is
much more distinct in the time series of ratios than in
the series of unchanged counts.
The purpose of the work described here is to provide
the epidemiologist or system designer with a mathematical tool to decide whether to use individual counts or
to use proportions, and which proportions, when monitoring daily surveillance data. The approach herein
complements the proportional model networks of Reis
et al.4 and may be used to decide which target/context
ratios should be monitored in a multisource network,
thus reducing the network size and computational
complexity.
Technical Approach for Evaluating Ratios for Signal-to-Noise
Discrimination
The approach described here to solve the “denominator problem” is to use information-theoretic techniques
to determine under what conditions proportions are
preferable to counts and whether ratios formed by using
a particular reference series are more effective than
simple counts for anomaly detection. Given two data
streams X = {x(t)} and Y = {y(t)}, their mutual information
I(X; Y) provides a general measure of their interdependency. Let X have a probability density function (pdf)
given by p(x), and let p(y) be the pdf of Y. If their joint
315­­­­
H. S. BURKOM et al.
Daily syndrome counts
500
50
400
40
30
300
20
200
10
100
All diagnosis counts
600
60
0
0
03/01/94
04/20/94
06/09/94
07/29/94
09/17/94
11/06/94
12/26/94
02/14/95
Outbreak signal
Syndrome proportion
0.11
0.10
0.09
0.08
0.07
0.06
0.05
0.04
02/25/94
04/16/94
06/05/94
07/25/94
09/13/94
11/02/94
12/22/94
02/10/95
Figure 2. When a denominator variable (or context stream) such as total diagnosis count is available, a proportion sometimes can clarify
an outbreak signal.
pdf is denoted by p(x, y), then the mutual information
I(X; Y) between X and Y is defined formally as8
 p(x, y) 
I(X; Y ) = ∑ ∑ p(x, y)log 
.
 p(x)p(y) 
x y
(1)
Note that the mutual information also can be written as
I(X; Y) = H(X) 1 H(Y) 2 H(X, Y) ,
(2)
where H(X) is the usual Shannon entropy of a single
variable defined by
H(X) = −∑ p(x)log p(x) (3)
x
Mutual Information Estimation
and the joint entropy H(X, Y) is given by8
H(X, Y ) = ∑ ∑ p(x, y)log p(x, y) . x
(4)
y
Mutual information (MI) is a general measure of
dependency in data: a nonzero MI indicates that two
data streams X and Y share some kind of dependence,
linear or not. We use MI to decide which data streams
are more appropriate for use as a context stream to a
given target stream. In essence, we are designing a noise
filter by using a context channel to remove the noise
from the target channel. MI is a nonlinear generalization
of the well-known linear dependency summary statistic,
the Pearson linear correlation. The advantages of using
316
MI over Pearson correlation as a summary statistic can
be seen by the following example. Let a random variable
X be uniformly distributed on the interval from 0 to 1,
and let Y = X2. Then Y is completely determined by X, so
that X and Y are dependent, but their correlation is zero;
they are uncorrelated. The mutual information between
X and Y, however, is significant.
MI not only provides a general measure of dependency between two variables, but it also has the important feature (for biosurveillance applications) of being
robust to missing data values and has been shown to be
advantageous in analyzing datasets in which up to 25%
of the values are missing.9 For these reasons, the mutual
information metric has been used to analyze dependency
in such diverse fields as bioinformatics,9–11 physics,12,13
and ecology.14
Calculating precise values for MI is nontrivial, and
the accurate estimation of mutual information has been
discussed by Kraskov et al.15 and Steuer et al.16 The most
straightforward technique is to use a histogram-based
procedure.11 In this method, a bivariate histogram is
used to approximate the joint pdf of the variables. The
use of histograms requires an appropriate choice of binning strategy to obtain an accurate but economical estimate of the joint pdf. Some popular binning strategies
are those of Sturges17 (lower bound on number of bins
k ~ 1 1 log2(N), where N is the data size), Law and
Kelton18 (upper bound on number of bins k ~ N1/2),
(number of bins k ~ 1 + log2(N)), and Scott19 (bin width
h = 3.5s/N1/3, where s is the sample standard deviation).
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
Other authors have chosen to circumvent the histogram problem by using different techniques to estimate
the joint pdf. These include adaptive binning,12 k nearest-neighbor statistics,15 kernel density estimators,20 and
B-spline functions.10 We have compared some of the different methods of calculating mutual information and
found little relative difference in the MI rankings for
pairs of variables in a dataset. The absolute magnitudes of
I(x; y) for the same pair of variables, however, did change
with the particular algorithm employed (depending on
the accuracy of the approximation method and on the
normalization used).
As seen from Eq. 4, mutual information is bounded
above by the sum H(X) + H(Y), so that if this sum is
small, I(X; Y) can be small even if X and Y are completely correlated. To obtain a dependency measure
whose magnitude is meaningful, some authors normalize MI so that its maximum value is unity if X and Y are
completely correlated, similar to the behavior of the correlation coefficient ρ. Normalization techniques include
division of I(X; Y) by the arithmetic mean21 (H(X) +
H(Y))/2, by the maximal marginal entropy of the considered dataset,22 or by the geometric mean (H(X)H(Y))1/2
of the individual entropies.23
The mutual information here was calculated in two
ways: the first by using a histogram method with a binning procedure following Priness et al.11 and the second
by using the B-spline method outlined by Daub et al.10
When using the binning technique, we found it convenient to use an arithmetic mean normalization.19 The
B-spline calculations were performed by using the source
code in C11 made freely available by Daub et al.10 for
noncommercial use. The choice of mutual information
estimation algorithm did not significantly change the
conclusions of our study, so the MI calculations shown
below will be those calculated with the B-spline algorithm, whose source code is easily obtained.10
Testing Benefits of Mutual Information-Based
Discrimination Using Simulation
Based on the discussion above, a series of Monte
Carlo simulations were conducted on time series of
deidentified, syndromic outpatient visit counts made
available for ESSENCE research. The simulations
were implemented by introducing simulated lognormal
spikes in the daily counts of the health time series to
simulate outbreak data effects. In a representative alerting algorithm used for evaluation in this study, series
z-score values above a given threshold were considered
to be alerts, where a z score was obtained by subtracting
the sliding baseline mean from the current count and
dividing by the baseline standard deviation. This simple
detector was implemented over a large number of simulations with varying thresholds and spike magnitudes to
compare the probability of detection in (i) daily counts
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
in a single-syndrome data stream and (ii) daily counts
made up of the ratio of this single-syndrome data stream
with other syndromic data streams (including the total
daily visit count).
We illustrate typical simulation results using daily
counts of rash syndrome visits over a 1402-day period.
The time series of these rash syndrome counts is plotted in the upper graph of Fig. 3 and labeled R. A data
stream constructed from a Poisson-distributed random
time series with mean l = 50 over the same time period
is shown in the lower graph of Fig. 3 and denoted by
P. Monte Carlo simulations then were performed on (i)
the rash data stream R alone, (ii) the rash data stream
R divided by the Poisson context data stream P (target/
context average MI = 0.01), (iii) R divided by the context
series P 1 R (target/context average MI = 0.06), (iv) R
divided by the context P 1 2R (target/context average
MI = 0.18), (v) R divided by P 1 3R (target/context average MI = 0.32), and (vi) R divided by P 1 4R (target/
context average MI = 0.44).
The results are summarized by the receiver operating characteristic (ROC) curve shown in Fig. 4. ROC
curves are plots of the probability of detection (PD) (the
fraction of true positives, an empirical sensitivity estimate) versus the probability of false alarm (PFA) (estimated as the fraction of threshold exceedences among
the unspiked background data) as the discrimination
threshold is varied.24 As the context time series C
included more of the original signal R (from C = P to
C = P + 4R), the average mutual information between
the target P and the context changed from 0.01 to 0.44.
From the plotted ROC curves, detection probabilities
increased almost uniformly with the target/context MI.
The improvement in signal-to-noise ratio achieved
by replacing count data with target/context pairs with
positive MI can be expressed directly by the quantity AD
= PD(T + C) – PD(T) at a given PFA value and threshold. The quantity PD(T) is the probability of detection
for the target series T alone, and PD(T + C) is the corresponding detection probability for the data stream using
the target and context in the ratio T/C. The quantity
AD will be called the “detection advantage,” because it
summarizes the improvement in alert detection by using
a proportion with the target/context pair (T, C) over
using the target T alone.
Figure 5 shows the detection advantage AD for the
series of Fig. 4 at PFA = 0.03 for both the cross-correlation and mutual information measures. The cross-correlation ρxy between successive pairs of datasets X and Y
was calculated from the expression
xy =
E[(X − mx )(Y − my )]
x y
,
(5)
where E is the expected value operator, and σx2 and
σy2 are the respective variances of X and Y. Clearly, the
317­­­­
H. S. BURKOM et al.
14
R (rash counts)
12
10
8
6
4
2
0
80
P (Poisson noise)
70
60
50
40
30
02/28/94
09/16/94
04/04/95
10/21/95
05/08/96
11/24/96
06/12/97
12/29/97
1.00
0.25
0.90
0.20
PD (T + C)–PD (T )
Probability of detection (PD)
Figure 3. Historical rash counts R from chief-complaint syndrome data (upper) used as the target stream to a context data stream P of
Poisson noise shown with a mean of 50 (lower).
0.80
0.70
0.60
Target only
MI = 0.01
MI = 0.06
MI = 0.18
MI = 0.32
MI = 0.44
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.02
0.04
0.06
0.08
0.10
0.05
0
–0.05
–0.10
–0.15
0.10
Probability of false alarm (PFA)
Figure 4. ROC curve constructed from the results of Monte Carlo
simulations for a target stream of daily rash syndrome counts
when ratios were formed by using context series having increasing mutual information with the target data stream.
detection probability is enhanced dramatically as the
mutual information and correlation between the target
and context series increase. For example, the use of the
Poisson noise time series (having an MI of only 0.01
with the rash count series) as a context gives lower sensitivity than monitoring the rash counts alone. As the
mutual information of the context data stream with the
target increases, the detection advantage AD increases
318
0.15
0
0.2
0.4
0.6
Correlation/MI
0.8
1.0
Figure 5. Plots of the detection advantage AD = PD(T 1 C) – P(T)
as a function of average mutual information (magenta) and correlation (blue) for the ROC curves shown in Fig. 4. The detection
probabilities were calculated at PFA = 0.03. The relative detection
probability increases dramatically as correlation and mutual information are increased.
sharply to 0.2, giving a 20% increase in the probability
of detection.
Additional sets of simulations were conducted with
35 other syndromic count series from the same dataset of
outpatient clinic visits. In Fig. 6, we show a set of ROC
curves using counts of a botulism-like syndrome as the
target and 10 other syndromic series (including the sum
of all syndromic visits) as contexts. The black curve is
the ROC for the “target-only” series of undivided counts.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
0.90
Target only
lesion2
gi_2
resp1
rash2
resp2
shk_coma2
Neuro_1
fever1
UGI_1
Total
Probability of detection (PD)
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0.00
0.05
0.10
0.15
Probability of false alarm (PFA)
0.20
Figure 6. ROC curve obtained by varying the detection threshold
over a set of Monte Carlo simulations when the syndrome “BotLike-2” is used as a target with 10 other health care data streams.
Note that the three curves below this target-only curve
correspond to algorithms applied to ratios using the context data series for neurological, shock/coma, and fever
syndrome counts. These three curves indicate lower
detection probabilities than the black curve generated
by the algorithm applied to counts alone because of
their low mutual information (MI  0.35) relative to the
target count series.
The results of our Monte Carlo simulations on outpatient clinic data can be summarized as follows: (i) using
MI as a metric was slightly better than using correlation,
(ii) using a shorter time window (e.g., 100 or 200 days) for
calculating the mutual information between two time
series is preferable to comparing their average MI over
their entire time history, and (iii) the maximum probability of detection among context series occurred almost
always when the total syndromic visit count series was
used as the context, even if another syndrome data
stream had higher MI with the target series. This observation is related to coherent integration effects25 arising
from the summation of a large number of (noisy) syndromic data series. This aggregation of many sources causes
the average noise amplitude in the total diagnostic series
to be reduced by a factor of √N, where N is the number
of syndromes in the total dataset. This noise reduction
is especially evident for highly multivariate datasets like
that used in our simulations (N = 35).
Practical Implementation of Mutual Information
Discrimination
Our MI criterion is intended to provide the epidemiologist with a mathematical tool to decide whether
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
to monitor individual counts or to use proportions, and
what kind of proportions, when monitoring a public
health network.
The Monte Carlo simulations discussed in the previous section suggest that it is useful to use ratios instead
of counts to clarify an outbreak if the numerator and
denominator of the ratio have sufficient mutual information. Based on results of simulations using an outpatient clinic visit dataset of 35 syndromic time series, our
simulations suggest a minimum MI of 0.5 for monitoring
target/context ratios instead of counts if the method of
Daub et al.10 is used for MI calculation. For large datasets, it can be advantageous to use the total diagnostic
counts as the context because the noise levels of this
data stream are considerably reduced by coherent integration. The advantage of monitoring ratios instead of
counts also may be improved by transforming the data
series, using short-history time windows, or implementing more sophisticated detection algorithms. Finally, it
also is beneficial to use this method over short time windows (e.g., 200 days) to ensure that the measured dependency between target and context series is recent and is
not washed out by multiyear averaging.
Signal Processing RLS Filters Adapted
for Background Prediction
Linear Filters for Adaptive Modeling
For useful anomaly detection in biosurveillance, it is
necessary to compare recent observed data to an estimate (or predicted value) of what the data ought to be
in the absence of a disease outbreak. Therefore, accurate prediction of the data background is an important
subtask of automated surveillance. The prediction of
syndromic data is complicated by strong nonstationarity
and by multiple issues related to data acquisition. Linear
filters are particularly useful in the background prediction problem because they are easy to implement and
because their adaptive formulation can deal effectively
with nonstationary data.26 The ability to use them in
multistream environments is an additional advantage.
A transient event of short duration superimposed on
a stationary (or quasi-stationary) background cannot
be usefully predicted by using linear prediction filters.27
However, when recent values of a single data stream
are used to predict future values, these predictions may
be used as a set of “background” values with respect to
which a threshold detector (for unusually high counts)
could operate.27 Accurate background estimates may be
expected to yield improved sensitivity at practical alert
rates as measured by ROC curves, as shown in the previous section.
Filter Modifications for Biosurveillance Applications
Linear prediction filters are generally easy to
implement when the optimality criterion is that of a
319­­­­
al.
minimum least-squared error. We have successfully pioneered and used the adaptive RLS linear filters in the
analysis of a “predict and detect” system using overthe-counter medications that were chosen based on
their high correlations with a single clinical dataset.27
The current effort shows that the same RLS prediction
method in a univariate (single data stream) environment
predicts accurate background counts for a large class of
surveillance time series.
A major problem for modeling syndromic data streams
from many health indicator information sources is the
day-of-week effect. When present, this effect results
in a consistent statistical disparity among observed
counts on different weekdays. A systematic and adaptive approach to this problem is a vital ingredient in our
RLS implementation. Our method to equalize the data
distributions relies on an invertible transformation of
the data values. The following section describes the dayof-week issue and our solution and implementation. We
then show the RLS implementation of the prediction
problem and present results and discussion.
All data values then are corrected to this reference line
by linear transformation. Figure 9 shows the corrected
(sorted) data values after this transformation, and the corrected daily values are shown in Fig. 10. Predictions then
proceed on these corrected values. The transformations
are inverted by using the same straight line parameters to
restore count levels on the original scale.
Background Prediction and the RLS Algorithm
We consider the raw time series of clinical visit counts
as both the primary and the reference data channel in
order to predict future counts in the following manner.
Today’s and several past days’ counts are combined to
make a future clinical data prediction, which then is
compared to the actual value of that day’s clinical data
when that value becomes available, and the error is used
1000
800
Algorithmic Treatment of Weekly Data Patterns
To illustrate the issue, we present military outpatient
clinic data from a large metropolitan area. The time series
in Fig. 7 represents nearly 4 years of typical daily counts
of visits whose diagnoses were classified in the respiratory
syndrome. Figure 8 shows the sorted (by value) data corresponding to different days of the week. Weekend (Saturday and Sunday) values fall on curves that are distinctly
separate (and lower) from those of the weekdays. Our
adaptive solution to this problem is as follows. We fit the
middle part of each curve with a straight line and form
a reference line using the median of the weekday lines.
Other choices for the reference line could work as well.
Observed visit count
H. S. BURKOM et 600
400
Weekday
counts
200
0
Weekend
counts
0
50
100
150
Cumulative observations
200
Figure 8. Cumulative frequency distribution (transposed) of daily
counts by day of week. The abscissa values are used as indices for
day-of-week equalization.
1200
1000
Sorted data values corrected
800
Observed visit count
Counts
800
600
400
200
0
0
600
400
200
500
Days
1000
1500
Figure 7. Daily counts of syndromic visits for the respiratory
syndrome of military outpatient visits from a large metropolitan area.
320
Sunday
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Index
0
50
100
150
Cumulative number of observations
200
Figure 9. Cumulative frequency distribution (transposed) of daily
counts by day of week after equalization.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
1000
This equation can be solved to give the forgetting factor in
terms of the effective memory of the filter, l = nl/1 1 nl,
which can be chosen according to observations made
on the underlying dataset. We have generally found an
effective memory value of 4 weeks (28 days) to give a
useful value l = 0.9655. The least-squares analogue of
the ordinary Wiener filter equations then are
Counts
800
600
400
200
0
0
500
Days
1000
1500
Figure 10. Daily counts of syndromic visits after equalization
transformation.
to update the filter coefficients in such a way as to minimize the square of the error.
Denoting the daily counts by yn, a linear prediction,
P days ahead, is given by
yˆ n + P =
M −1
∑ h m yn − m , (6)
m=0
where we have assumed a set of linear filter coefficients
h = [h0, h1, . . ., hM 2 1]T of length M. In vector notation,
this equation is
yˆ n + P = h T y n , (7)
where h = [h0, h1, . . ., hM 2 1]T, yn = [yn, yn 2 1, . . .,
2
yn 2(M 2 1)]T, and T denotes matrix transposition. The
prediction error is given by
ek = yk − yˆ k , (8)
and the performance index at day n is
εn =
n
∑ ln − k |ek |2 .
k= 0
The “forgetting factor” λ is introduced to deal with nonstationary behavior so that more recent data are given
more emphasis.27 The relationship between the forgetting factor and the effective memory of the filter is found
from
.
∑ nln
nl =
n=0
.
∑l
n
=
l
.
1− l
n=0
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
n
n
k= 0
k= 0
∑ �n − k yk − m yk − l hl(n) =∑ �n − k xk ym , (9)
where we have introduced the variable x n ≡ yˆ n + P .
Details of the RLS adaptive algorithm are provided in
Box 1.
Note that our implementation uses multiple predictions for the same day, as described above and shown in
Fig. 11, in which the data between the left-hand arrows
are used to predict each data point (red circle). Thus,
given today’s and the past available data—i.e., indices
n – (M – 1), . . ., n—we make a prediction for day index
n 1 P, using the present filters. In addition, we use the
same filters to make a prediction for day n 1 P – 1,
using the available data indices [n – 1 – (M – 1), . . ., n
– 1], and similarly we continue to make a prediction for
every prior day up to and including the index n 1 1 {the
corresponding data indices for the latter prediction are
[n – (M – 1) – (P – 1), . . ., n – (P – 1)]}. In other words,
every day in the future will have P predictions that were
made when that point was P days ahead of the index n,
P – 1 days ahead, and so on, until it was only 1 day
ahead. So when the index n (i.e., today) turns to n 1 1,
i.e., the data for tomorrow become available, we have P
possible error terms, only one of which can be fed
back into the recursive update equations. In the present implementation, we have found that choosing
the error term with the smallest magnitude produces
the most effective coefficient updating in terms of
prediction accuracy. We experimented with feeding
the most recent estimate as well as the one with the
smallest magnitude and found the latter to provide
slightly better results. In many cases, of course, the
most recent estimate also was the one with the smallest
magnitude.
Comparative Performance of Adaptive RLS Filter on
Syndromic Data
Table 1 lists the daily syndromic time series used to
test this method. We used a training period of 1 year
to compute reliable estimates of the linear transformation coefficients required in the day-of-week equalization procedure prior to prediction. The last two columns
of Table 1 show the median fractional absolute difference for RLS predictions of the listed nine syndromes
compared with predictions obtained with a method
321­­­­
H. S. BURKOM et al.
Box 1
The quantities
a posteriori quantities. An important step in the RLS algorithm is the computation of the inverse correlation matrix,
which by using the matrix inversion lemma can be written
as follows:
n
R(mln) ≡ ∑ l n − k yk − m yk − l
k=0
and
rm(n) ≡ ∑ � n − k x k ym ,
in terms of the Kalman gain vectors
k=0
which play the roles of the correlation matrix and the crosscorrelation vector, satisfy rank 1 update equations
R(n) = lR(n − 1) + y (n) y (n)T
r
(n)
= lr
(n − 1)
+ xn y
(n)
,
where the superscript n denotes the present day number (with
0 indicating the first day). The least-squares solution to a
given a pair { R(n), r (n) } is clearly
h (n) = [ R(n) ]−1. r (n) ,
xˆ (n) = h (n)T. y (n) .
s5
s4
and the likelihood variable m(n) = (1 + y (n)T. K (n))−1.
The filter update equations are obtained by applying
R(n + 1) (i.e., the correlation function of the least-squares
problem at step n + 1) to the left of h (n) (i.e., the solution to
the least-squares problem at step n):
h (n) = h (n − 1) + e(n|n − 1)m(n)K (n|n − 1)
e(n) = � n e(n|n − 1) .
The hybrid quantities are defined as follows:
The RLS algorithm26 is an adaptive method to relate the
solution to the least-squares problem at step n to the solution of the least-squares problem at step n 1 1. In the parlance of Kalman filtering theory (sequential estimation
theory), the quantities at step n (present) are referred to as
the a priori values, whereas those at step n 1 1 (next step) are
s6
K ( j)≡ [ R( j) ]−1 y (n), j = n, n + 1
while the prediction errors are related to each other by
and the corresponding prediction value (filter output) is
[ R(n + 1) ]−1 = [ R(n) ]−1 − m(n)K (n)K (n)T
n
s3
s2
s1
xˆ (n|n − 1) = h (n − 1)T. y (n)
e(n|n − 1) = x n − xˆ (n|n − 1)
K (n|n − 1) = l −1[ R(n − 1) ]−1. y (n)
m(n) = (1 + K (n|n − 1)T. y (n))−1 = 1 − K (n)T. y (n) .
p1
p2 p3
Figure 11. Schematic illustration of prediction algorithm (e.g., values on days s1–s4 are
used to predict p3).
Table 1. Median fractional absolute forecast error for nine syndromic time series comparing
the CDC W2 algorithm to adaptive RLS.
Syndrome
Respiratory 1
Respiratory 2
Fever 1
Rash 2
Gastrointestinal (GI) 2
Lower GI 2
GI 1
Lower GI 1
Neurological 2
322
Mean count
335
162
79
78
62
55
53
38
35
CDC W2 (%)
13
16
16
18
16
22
16
18
19
RLS (%)
9
10
14
17
15
16
15
15
18
currently used in the CDC BioSense
system28 for monitoring public
health on the national level. This
reference method is an enhancement of the Early Aberration
Reporting System (EARS) C2
algorithm,29 denoted W2, in which
separate C2 implementations are
applied to normal weekdays and
to weekend/holidays. For the syndromic visit count series used, note
that the RLS fractional prediction errors are consistently below
those of the reference W2 method.
Importantly, for day-to-day monitoring experience, this advantage
holds uniformly when comparisons
are stratified by day of week and by
month of year.
The modified RLS filter-based
predictor presented above yielded
useful predictions for a variety of
city-level syndromic data series.
More recent comparisons suggest
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
that it may be useful on smaller scales as well, and efforts
are ongoing to establish the class of time series for which
it is useful and to select this predictor based on a modest
set of historic data. A logical next step is to investigate
the advantage in signal detection capability that may be
obtained by using residuals calculated with this specialized RLS adaptation. Both simple threshold detectors
and more complex control charts are to be applied for
this purpose.
Practical Adaptation of Temporal Alerting
Methods for Routine Operations
Basic challenges and common approaches used in
univariate temporal alerting algorithms for biosurveillance have been discussed previously in this journal.1
This section discusses advances and adaptations that
have resulted from experience working with epidemiologists, from the changing data environment, and from
recent research.
Adaptive Choice of Detection Algorithms
The use of data models and control charts for disease
surveillance was discussed previously in this journal.1
Although the adaptive regression approach discussed
in that article has proved applicable to many facilitylevel and county-level data streams, this modeling
has little explanatory value for sparser time series that
occur when data records are filtered more finely because
of geographic restrictions, concerns for particular
age groups, or subsyndrome classifications. Moreover,
ESSENCE has evolved to allow users to choose ad hoc
combinations of data filters, and rapid anomaly detection of the resulting time series is required. The first
and currently implemented solution to the dilemma of
whether and how to model was an automated choice
between adaptive regression and an exponentially
weighted moving average (EWMA) chart, discussed
next. A more recent, unified solution uses generalized
exponential smoothing, discussed in the section on
Holt–Winters-based control charts.
The current automated algorithm selection is determined by a goodness-of-fit test to determine whether
the adaptive regression model has explanatory value.
For each day’s baseline, regression coefficients are refit
by using standard predictor variables. For a goodnessof-fit measure, the adjusted R2 coefficient estimates the
portion of the sum of squared baseline modeling errors
that are explained by the predictor variables. If this measure exceeds 0.6, the system uses the regression model
to make a current-day forecast and decide whether the
day’s observed count is anomalous. Time series of record
counts using the more inclusive respiratory and gastrointestinal syndromes usually pass this test in medium
to large regions where the median daily count is well
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
R2
above 10. However, if the adjusted
does not exceed
0.6, an adaptive EWMA control chart is applied to test
for anomaly.
The experience of recent years of ESSENCE use has
led to modifications in both the regression model and
EWMA algorithms implementations, and we summarize
these modifications below.
Modifications to Adaptive Regression Algorithm
• Predictor variables: The various data types used in
ESSENCE show several different day-of-week patterns, with many data sources showing characteristic weekend drop-offs ranging from 75% to 100%.
Weekly patterns in hospital emergency room data
vary widely according to the type of hospital and level
of weekend staffing. To accommodate the resulting
range of time series behaviors, six day-of-week indicator variables now are included in the model. The
other predictor variables are a linear trend term and
indicators for holiday and post-holiday effects. By
contrast with some research efforts,30,31 long-term
predictors such as harmonic seasonal terms are not
included in ESSENCE because the data history often
is not sufficient for the use of these terms, and even
when it is sufficient, year-to-year changes in information systems, diagnosis coding, population behavior, and even weather can degrade their predictive
value.
• Outlier removal: Alerting problems have occurred
because of anomalous baseline data values even when
the goodness-of-fit test selects the regression model.
These outlier values may reflect true health events
but often result from issues in the data chain, and,
regardless of their cause, they should not be used to
infer regression coefficients for forecasting. To avoid
training on these outliers, observations outside the
alerting confidence bounds are replaced by those
bounds for model-fitting.
• Probability-scale representation of algorithm output:
The unscaled outputs of the regression algorithm are
the forecast errors divided by the standard error of
regression. For alerting purposes, values representing degree of anomaly are derived from Student’s t
test distribution lookups using these outputs with
the number of degrees of freedom set to the baseline
length minus the number of predictor variables. Note
that the resulting p values are not outbreak probabilities but only statistical anomaly measures on a 0–1
scale. Use of this scale allows comparison of outputs
among algorithms and direct combination of these
outputs by multiple univariate methods.
Modifications to the EWMA Control Chart Algorithm
When the regression goodness-of-fit test fails in
ESSENCE data analysis, an adaptive version of the
323­­­­
H. S. BURKOM et al.
EWMA control chart of statistical process control (SPC)
is applied.32 The SPC formulation assumes that the
input data stream Xt is Gaussian with mean m and variance σ 2 . The ESSENCE test statistic is (Zt − xt ) / kst ,
where Zt is the current weighted moving average
Zt = ωXt 1 (1 – ω)Zt – 1 ,
(10)
and ω is a smoothing constant between 0 and 1. The
sliding baseline mean xt and standard deviation st give
estimates of µ and σ. The denominator kst approximates
the standard deviation of Zt, where the constant k is
given by
 ω 
(1 − (1 − ω )2 t ) . k2 = 

 (2 − ω ) 
(11)
The ESSENCE implementation uses a 28-day sliding
baseline and a 2-day guard band, or buffer, separating
the baseline from the current day.
Adaptations to EWMA Chart Implementation
The following modifications have been adopted to
improve the accuracy and robustness of statistical alerting in response to concerns of epidemiologist users,
typical data behavior, and the most common problems
in the data acquisition chain.
(i) Sensitivity to both sudden and gradual signals: Data
signals expected from infectious disease outbreaks are
not the lasting step increases that one would expect of
a mean shift. The signals of interest are transient data
effects of epidemic curves of attributable cases lasting
from a few days to a month or more. Even for a given
disease such as influenza, the outbreak signal may be
sudden and explosive or more gradual. In a published
hospital-based application, cumulative sum (CUSUM)Shewhart and EWMA-Shewhart charts were applied to
detect both types of signal in the monitoring of hospital infections.33 Those authors found an EWMA-Shewhart chart preferable for autocorrelated background
data. Emulating this strategy, the ESSENCE EWMA
algorithm is applied for smoothing coefficients of both
0.4 and 0.9, with the smaller coefficient for sensitivity
to gradual signals and the larger one to approximate a
Shewhart chart for sensitivity to spikes.
(ii) Correction for the adaptive baseline: The conventional EWMA statistic was developed for stationary
Gaussian data, and the use of the sliding window for
daily parameter adjustment changes the variance of the
weighted moving average from the simple, fixed-parameter situation. From taking the variance of (Zt − xt )
and expanding, the total adjusted variance at time step
j is the st2 times the factor
324
 ω 
2j
 (2 − ω )  (1 − (1 − ω ) )
+ (1 / B) − 2(1 − ω )
g +1
 (1 − ω )B − 1 


B


(12)
for baseline length B and guard band length g. Therefore, the sample standard deviation st is multiplied by the
square root of this factor in the test statistic.
(iii) Probability-scale representation of algorithm output:
As in the regression model implementation, this modification was partly a cultural one to bridge the gap
between classical epidemiologist practice and SPC control chart usage. Given the modification ii for the adaptive baseline, probability values representing the degree
of anomaly are derived from Student’s t test distribution
lookups with the number of degrees of freedom set to
the baseline length – 1, assuming that the input data are
Gaussian. Again, the resulting p values should not be
interpreted as outbreak probabilities.
(iv) Bounding the baseline variance: This modification was needed because of the many varied time series
monitored for both general syndromic and more diagnosis-specific surveillance. Because of the multiple strata
used as data filters, many of the monitored time series
are sparse, with median daily counts of zero. Thus, for a
4- to 8-week sliding baseline, the sample standard deviation st of these series often is near zero, and for detection statistics analogous to (x – u)/st, near-zero variances
cause instability and unwanted alerts. In general, single
isolated cases are not signals. Therefore, health monitors
were asked which small temporal clusters should cause
alerts and which should not. Minimum variances were
established from this guidance by algebraic manipulation of the EWMA test statistic.
(v) Small-count corrections for monitoring non-Gaussian time series: Despite the adjustments above, for many
time series the algorithms were producing too many
alerts because series values were not Gaussian-distributed. In biostatistics applications, count data often are
assumed to obey a Poisson distribution, for which the
variance equals the mean value. The distribution of
syndromic count data is usually closer to Poisson than
to Gaussian in that the dependence of the variance on
the mean is evident (though overdispersion evidenced
by exaggerated variances often is seen because the
counts are not from homogeneous populations). For
this reason, false alarms were reduced by adding a factor
c/st to each Gaussian-derived alerting threshold, i.e., by
adding a fixed, empirically derived constant c divided
by the data standard deviation. This approach produces
adjustments that are significant for small-count time
series but negligible for series with larger counts. For
Poisson data streams, this approach yielded expected
probabilities of threshold exceedence for mean count
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
values at least as small as 0.1 per day. The threshold
adjustment is not directly applicable to the EWMA
strategy because the EWMA Zt of a Poisson variable
is not still Poisson. However, the correction applied
to Zt/ω again produced the expected probabilities for
smoothing constants ω = 0.4 and 0.9 when the correction constant c was empirically computed as a function
of both ω and the desired p value α. Optimizing this
function over a range of mean values from 0.1 to 20
gave the following formula for c:
c(ω, α) = 0.1304 – (0.2409 – 0.1804
× (1 – ω)4) × log(10 × α) .
(13)
Figure 12 illustrates the effect of this correction.
Combining ii through v, the corrected adaptive
EWMA alerting method for smoothing constants ω =
0.4 and 0.9 is
• At step j, compute the test statistic Zt = (Et – xt )/
max(st*, smin), where Et is the current EWMA, xt
is the current baseline mean, st* is the baseline standard deviation adjusted as in ii, and smin is the empirical minimum in iv.
• Form the adjusted test statistic Zt* by subtracting the
non-Gaussian correction term
Zt* = Zt – c(ω, α) × ω/(ω, α) × ω/max(st*, smin) .(14)
• Alert if Zt* exceeds the critical t distribution value
for B – 1 degrees of freedom at p value α, where B is
the baseline length.
10–1
3S threshold
Probability of exceeding
Modified 3S threshold
10–2
10–3
5
10
15
Mean count before smoothing
20
Figure 12. Monte Carlo estimates of false-alert probabilities for
an EWMA parameter ω = 0.4. The red line is desired alert probability, the blue curve is the result if the EWMA input data are Poissondistributed, and the green curve is the result for the same inputs if
the threshold is adjusted to 3 1 1.07 ω/s.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
(NZ(B1)/length(B1))M < α ,
(15)
where NZ(B1) is the number of zeros in Eq. 10 and α
is a threshold probability, with α = 0.01 adopted from
experience. Thus, the string of zeros is more likely to
be treated as missing data if few zeros appear in the rest
of the counts. The implemented logic is more complex to account for multiple zero strings, but this basic
idea avoids nuisance alarms and quickens algorithm
recovery time.
Limitations of Automated Regression/Control-Chart
Approach
Desired alert probability
10–4
0
(vi) Automated management of apparent data drop-outs:
Some algorithm problems reported by ESSENCE users
have resulted from data acquisition or reporting problems, and these were frequently caused by data drop-outs.
Typically, a clinic’s information system goes down for a
week or more, and data from the affected facilities are
interrupted until transmission to ESSENCE is restored.
Sometimes corrected previous-day values become available, but often the system retains zero values for those
days. Statistical detection and adjustment for these
problems is difficult because full and partial drop-outs
may cause a variety of “errors” in the sense of irrelevant
alarms. The simplest resulting problem is that the sliding
baseline mean and standard deviation assume unrealistic values during the drop-out, and meaningless alerts
are seen when good data start up again. This situation
arises from a common informatics limitation that missing data often are impossible to distinguish from zero
values because of the large number of data suppliers and
confidentiality restrictions in biosurveillance systems.
Therefore, the following statistical solution has been
adopted. Suppose that the baseline contains a string of
M zeros and that Eq. 10 is the set of values outside this
string. The M zeros are treated as missing data, and the
baseline is restarted if
The simple algorithm selection criterion combined
with improvements to the regression and EWMA algorithms has helped reduce data/algorithm mismatch and
false-alarm rates in the environment of increasingly
complex data and surveillance objectives. However,
limitations of this approach cause occasional anecdotal
problems, and the sheer number of data streams to monitor calls for continued improvements and system-level
algorithm combinations.
The most frequent problems result from violations of
the underlying data assumptions. For example, regression residuals in syndromic time series often are autocorrelated, variances are nonhomogeneous, and means are
not constant as a function of time. These realistic data
issues can cause the regression and EWMA methods to
scale anomalies differently. More advanced modeling
325­­­­
H. S. BURKOM et al.
approaches like weighted and robust regression, autoregressive integrated
moving average (ARIMA) modeling, and transformation of the dependent
variables can reduce these problems,31,34 but often they introduce more
assumptions and more coefficients to be estimated. It is usually suggested
to have at least five observations for robust estimation of every parameter
in regression model. Thus, predictors in the current ESSENCE model suggest at least 8 weeks of data in the baseline, and much longer characteristic
data baselines are available from relatively few data sources. Furthermore,
the advantages of more complex modeling may come at the cost of resourceintensive data analysis,35 and such analysis often is impractical for numerous,
ad hoc data streams.
Anecdotal problems have been caused by artifacts of the fixed-length
moving baseline and the exact goodness-of-fit criterion for choosing regression modeling. One approach that is more flexible than regression but can
accommodate seasonality, short-term trends, and data peculiarities is generalized exponential smoothing, which is widely used for financial forecasting
and introduced in the next section.
Data Forecasting Using Generalized Exponential Smoothing
The idea of exponential smoothing has been extended to the modeling of
local changes in trend36 and then in periodic behavior.37 In the well-known
Holt–Winters implementation, updating equations analogous to Eq. 10 are
combined to obtain forecasts accounting for changes in process mean, seasonal components, and trend behavior. Specifically, let α, β, and γ denote
smoothing coefficients (where α has the role of ω in simple EWMA) for
updating terms corresponding to the level, trend, and seasonality, and let s
be the cycle length, in data time steps, of seasonal or periodic behavior in
the data series. Forecast components mt, bt, and ct then are updated with the
following equations.
Level: Trend: Seasonality: mt = 
yt
ct − s
+ (1 − )(m t − 1 + bt − 1), 0 <  < 1 bt = b(mt 2 mt 2 1) 1 (1 2 b)bt 2 1, 0 < b < 1
ct = �
yt
m
t − 1*
+ (1 − �)c t − s , 0 < � < 1 (16)
(17)
(18)
Eq. 18 contains a recent, robust improvement with the use of mt – 1.38 In the
multiplicative Holt–Winters procedure, these components are combined to
yield a k step-ahead forecast:
yˆ n + k|n = (m n + kb n )(c n − s + k ) . (19)
Holt–Winters Forecasting for Biosurveillance
Forecasting by the Holt–Winters method has been applied39 to the prediction of daily biosurveillance data using 16 authentic data streams on the
scale of Fig. 7. In this study, the cycle length s was set at 7 to account for
the frequent but data-source-dependent weekly patterns reported above
for daily syndromic series, and Holt–Winters forecasts gave forecast errors
326
that consistently were lower than
those obtained with an adaptive
regression model. We have applied
Holt–Winters forecasts to build an
anomaly detector that is robust with
respect to data scale and to common
data features such as holiday effects.
We summarize the modifications
required for robust forecasting and
then present the adaptive Holt–
Winters control chart.
(i) Updating adjustments for predictable outliers: Updating is applied
with the usual cyclic multiplier
replaced by a special factor for holidays or other calendar events.
(ii) Updating adjustments for unexpected outliers: Updating of the component terms is suspended if the
current data are anomalous according to an absolute fractional error
criterion. This condition has been
refined across data scales to avoid
spurious training.
(iii) Adjustment for sparse time series:
Data with sequential zeros may cause
an imbalance in component updating that leads to artificial trending.
To avoid such artifacts, a small datadependent constant is added to all
input values and then subtracted
from the composite forecast.
(iv) Choice of initial values for level,
trend, and cyclic components: Initial
value effects are negligible in simple
EWMA but are known to be critical
in more generalized Holt–Winters
forecasting.35 Experience with syndromic series forecasting has confirmed this finding and shown that
grossly misspecified initial values
may degrade algorithm performance
for several months using daily data
and also may confound the choice
of smoothing coefficients α, b, and
γ. From experience with numerous
syndromic series on multiple scales,
the recommended starting values
for level, trend, and seasonal multipliers are the overall baseline mean,
an initial zero slope, and stratified
averages for the weekly multipliers,
respectively. If no historic data are
available, initialization should be
based on the best information or
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
experience at hand. After 4–6 weeks of data have been
collected, early guesses for the initial values should be
replaced with values for the recommended computations.
(v) Choice of smoothing coefficients: The selection of
Holt–Winters smoothing coefficients is critical for reliable forecasting. Based on forecast results using a variety
of data streams and data-grouping criteria, we categorized input series by median value using categories of
sparse (median < 1), low (1 < median < 10), average (10 <
median < 100) , or high (median > 100). Following previous studies,35,39 we applied a coarse, multidimensional
grid search to determine an effective set of smoothing
coefficients for each median category. Figure 13 illustrates the results of this search for combinations of α
and γ, for b = 0. Each color‑coded cell represents the
goodness-of-fit-based rank of a particular [α, γ] coefficient pair. Blue represents the best pairs as indicated by
the highest rank. The sets of coefficients shown in Table
2 were chosen for each group.
Table 2. Coefficient sets.
Name
Sparse
Low
Average
High
Median
0
[1,10)
[10,100)
[100, ∞)
Chosen [α, b, γ] coefficient set
[0.05, 0, 0.10]
[0.05, 0, 0.05]
[0.15, 0, 0.05]
[0.30, 0, 0.05]
in sample data means but also more subtly in residual
variances, despite forecast efforts to remove this effect.
We tested and adopted a method suggested by Chatfield40 and nearly identically by Koehler38 in which
variance is a function of level and cyclic parameters.21
The Koehler estimate provided the detection statistic
(yt+k − yˆ t(k))/ Var (et(k)) with the closest fit to a standard Gaussian distribution.
Holt–Winters-Based Algorithm Performance
Forming an Alerting Algorithm from the Residuals
Even for reliable data forecasts, the forecast residuals,
or differences between observed and expected values,
are not sufficient for anomaly detection. Estimates of
residual variance also are needed to calculate a detection statistic in order to reduce day-to-day changes in
sensitivity and specificity. For example, in syndromic
data streams, a day-of-week effect is present not only
A study has been completed to compare the detection performance of the above statistic to the traditional
regression-based detector. Simple 1-day spikes were
added to authentic background data streams previously
collected in ESSENCE. Probabilities of detecting the
injected spikes were tabulated for the new normalized
Holt–Winters statistic and for the adaptive regression
method, respectively. Average detection probabilities,
Grid Search Results
Syndromatic Data
0.40
60
0.35
50
0.30
A
0.25
40
0.20
30
0.15
20
0.10
10
0.05
0
0
0.1
0.2 0.3
G
Median = 0
0.4 0
0.1
0.2 0.3
G
Median < 10
0.4 0
0.1
0.2 0.3
G
Median < 100
0.4 0
0.1
0.2 0.3
G
Median = 100+
0.4
Figure 13. Results of grid searches for optimal sets of smoothing coefficients in each data category.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
327­­­­
H. S. BURKOM et al.
equivalent to the scaled area under the relevant part of
the ROC curve, were calculated for background alert
levels ranging from 1 per 28 days to 1 per 56 days. Table
3 summarizes the results of this study and shows a consistent detection performance advantage for the new
Holt–Winters method. Although it is probably demonstrable that more sophisticated modeling could improve
the regression detector for any given time series, such
efforts must keep in mind the variety of data types, limited on-site analysis capability, and other requirements
of biosurveillance systems.
With the success of these preliminary results, work
is in progress on evaluating Holt–Winters detection
performance for detection of signals of varying realistic
shape and duration for ESSENCE implementation.
Summary and Research Directions
This article shows how challenges of developing automated disease surveillance systems are being approached
by statistical means. The discussion has been limited
to univariate alerting methods, but current ESSENCE
methods also include multivariate statistical alerting for
simple data fusion41 and scan statistics for detection of
localized case clusters.1
A considerable body of research has addressed syndromic classification of clinical patient records in order to
clarify potential data signals for early warning of disease
outbreaks.6 Subsequent to these classification decisions
are questions of how to form the most useful outcome
variables and how to monitor them. Research issues
include how to best combine and transform the derived
time series. The MI criterion presented above gives a
means for evaluating target/context quotients, typically
a series of daily syndromic facility visit counts divided by
the daily facility visit total, for their capacity to increase
the signal-to-noise ratio for the syndrome of interest.
The MI simulation results suggest that appropriately
conditioning and combining syndromic data streams
can be important in achieving the signal-to-noise gains
necessary for improved detection performance in biosurveillance. Further research is continuing on applying
optimal filtering to the target/context problem and on
implementing shorter time windows for more discriminating alert detection. Information-theoretic methods,
little used in disease surveillance up to now, may have
additional potential utility, such as application to larger
sets of data streams to select advantageous combinations
for monitoring by adaptive multivariate SPC.
The requirement for adaptive data background estimation in biosurveillance is driven by the need for increasingly early signal recognition. Before the late 1990s,
most disease surveillance was done on weekly, monthly,
or longer time scales. More recently, daily monitoring
Table 3. Comparison of empirical detection probabilities obtained from Holt–Winters and
adaptive regression methods.
Syndromic data series
Resp_1
Resp_2
Fever_1
Rash_2
Gi_2
Lesion_2
GI_1
Lesion_1
Neuro_2
LGI_2
UGI_1
Hemr_ill_2
Bot_Like_2
UGI_2
Lymph_1
Shk_Coma_2
Bot_Like_1
Rash_1
Holt–Winters
detection
probabilities
0.988
0.994
0.921
0.994
0.990
0.979
0.989
0.923
0.986
0.920
0.910
0.944
0.957
0.888
0.828
0.579
0.645
0.537
Adaptive regression
detection
probabilities
0.984
0.983
0.879
0.985
0.985
0.960
0.945
0.943
0.969
0.870
0.822
0.863
0.882
0.832
0.724
0.517
0.509
0.484
Median daily
data count
310.5
184
90
75
72
62
55
43
39
23
15
15
15
8
7
5
3
3
Average probabilities are shown for practical background alert rates for a signal level of twice
the data standard deviation of each data series.
328
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ALERTING ALGORITHMS
of population data has become common among health
departments, and advances in hospital informatics systems are pushing analysts to develop near real-time alerting capability. We have shown how linear RLS filters may
be modified for improved prediction of syndromic time
series, and we obtained consistent improvements in prediction for a variety of city-level data series.
Efforts are currently in progress to specify the class
of time series for which these filters are most useful and
to calculate filter coefficients from limited data history.
Another important extension is the use of RLS filters
for multivariate background prediction, an effort that
is promising because of the potential to adapt quickly
to changes in the multistream covariance matrix. Such
changes present a major obstacle for other modeling
approaches.
The final section introduced anomaly detection
methods based on forecasts using generalized exponential smoothing. Engineering modifications were presented to get robust detection performance from Holt–
Winters forecasts across a large set of disparate syndromic time series.
We summarize the advantages of the Holt–Wintersbased adaptive control charts:
• The normalized Holt–Winters detector outperforms
traditional regression-based method on most syndromic data streams.
• As in the application of simple EWMA charts, robust
detection performance does not depend on assumptions of normality, stationarity, and constant variance.
• EWMA is a special case of Holt–Winters smoothing,
so the specific adjustments described above for sparse
data streams can be included in a more general Holt–
Winters framework.
• Only a limited amount of data is required to initialize
the algorithm, and the entire data stream can be used
for detection analysis.
• With the choice of a single, adaptive method for
each time series category, the automated switching
among algorithms is eliminated, avoiding occasional
problems from the anomaly scale or from incorrect
switching.
• The residual variance estimation presented reflects
natural properties of syndromic data. No significant
autocorrelation is left in the forecast residuals.
Following the success of these preliminary results,
further testing of adaptive Holt–Winters control charts
is underway for detection of signals of varying realistic
shape and duration. The varying objectives, data environments, and geographic scales of modern biosurveillance systems pose numerous obstacles that can be
solved only by a combination of epidemiological and
informatics advances along with analytical ones like
those of the current study.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
ACKNOWLEDGMENTS: This work was supported by Grant
1-R01-PH-24-1 from the CDC. The contents of this article
are solely the responsibility of the authors and do not
necessarily represent the official views of the CDC.
References
1Burkom,
H. S., “Development, Adaptation, and Assessment of Alerting Algorithms for Biosurveillance,” Johns Hopkins APL Technical
Digest. 24(4), 335–342 (2003).
2Hurt-Mullen, K. J., and Coberly, J., “Syndromic Surveillance on the
Epidemiologist’s Desktop: Making Sense of Much Data,” MMWR
Morb. Mortal. Wkly. Rep. 54(Suppl.), 141–146 (2005).
3Burkom, H., “Alerting Algorithms for Biosurveillance,” Chap. 6, in
Disease Surveillance: A Public Health Informatics Approach, J. S. Lombardo and D. L. Buckeridge (eds.), John Wiley & Sons, Inc., Hoboken,
NJ, pp. 143–192 (2007).
4Reis, B. Y., Kohane, I. S., and Mandl, K. D., “An Epidemiological
Network Model for Disease Outbreak Detection,” PLoS Medicine.
4(6), e210 (2007).
5Gordia, L., Epidemiology, Second Ed., W. B. Saunders, Philadelphia,
PA (2000).
6Babin, S., Magruder, S., Hakre, S., Coberly, J., and Lombardo, J. S.,
“Understanding the Data: Health Indicators in Disease Surveillance,”
Chap. 2, in Disease Surveillance: A Public Health Informatics Approach,
J. S. Lombardo and D. L. Buckeridge (eds.), John Wiley & Sons, Inc.,
Hoboken, NJ, pp. 43–90 (2007).
7Wojcik, R., Hauenstein, L., Sniegoski, C., and Holtry, R., “Obtaining
the Data,” in Disease Surveillance: A Public Health Informatics Approach,
J. S. Lombardo and D. L. Buckeridge (eds.), John Wiley & Sons, Inc.,
Hoboken, NJ, pp. 91–142 (2007).
8Cover, T. M., and Thomas, J. A., Elements of Information Theory, John
Wiley & Sons, Inc., Hoboken, NJ (2006).
9Herwig, R., Poustka, A. J., Meuller, C., Lehrach, H., and O’Brien, J.,
“Large-Scale Clustering of cDNA-Fingerprinting Data,” Genome Res.
9, 1093–1105 (1999).
10Daub, C. O., Steuer, R., Selbig, J., and Kloska, S., “Estimating Mutual
Information Using B-Spline Functions—An Improved Similarity
Measure for Analysing Gene Expression Data,” BMC Bioinformatics
5, 118 (2004).
11Priness, I., Maimon, O., and Irad, B.-G., “Evaluation of Gene-Expression Clustering via Mutual Information Distance Measure,” BMC Bioinformatics 8, 111 (2007).
12Fraser, A. M., and Swinney, H. L., “Independent Coordinates for
Strange Attractors From Mutual Information,” Phys. Rev. A 33(2),
1134–1140 (1986).
13Sagar, R. P. and Guevara, N. L., “Mutual Information and Electron Correlation in Momentum Space,” J. Chem. Phys. 124, 134101
(2006).
14Nichols, J. M., Moniz, L., Nichols, J. D., Pecora, L. M., and Cooch,
E., “Assessing Spatial Coupling in Complex Population Dynamics
Using Mutual Prediction and Continuity Statistics,” Theor. Popul.
Biol. 67(1), 9–21 (2005).
15Kraskov, A., Stogbauer, H., and Grassberger, P., “Estimating Mutual
Information,” Phys. Rev. E 69(6), 066138 (2004).
16Steuer, R., Kurths, J., Daub, C. O., Weise, J., and Selbig, J., “The
Mutual Information: Detecting and Evaluating Dependencies Between
Variables,” Bioinformatics. 18(Suppl. 2), S231–S240 (2002).
17Sturges, H. A., “The Choice of a Class Interval,” J. Am. Stat. Assoc.
21, 65–66 (1926).
18Law, A. M., and Kelton, W. D., Simulation Modeling and Analysis,
McGraw–Hill, New York (1997).
19Scott, D. W., “On Optimal and Data-Based Histograms,” Biometrika
66(3), 605–610 (1979).
20Moon, Y.-I., Rajagopalan, B., and Lall, U., “Estimation of Mutual
Information Using Kernel Density Estimators,” Phys. Rev. E 52(3),
2318–2321 (1995).
21Witten, I. H., and Frank, W., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann
Publishers Inc., San Francisco, CA (2000).
22Michaels, G., Carr, D., Askenazi, M., Furman, S., Wen, X., and Somogyi, R., “Cluster Analysis and Data Visualization of Large-Scale Gene
Expression Data,” Pac. Symp. Biocomput. 3, 42–53 (1998).
329­­­­
H. S. BURKOM et 23Strehl,
al.
A., and Ghosh, J., “Cluster Ensembles—A Knowledge Reuse
Framework for Combining Multiple Partitions,” J. Machine Learn.
Res. 3(3), 583–617 (2002).
24McNicol, D., A Primer of Signal Detection Theory, George Allen &
Unwin, London (1972).
25Mahafza, B., Radar Systems Analysis and Design Using MATLAB,
Chapman & Hall/CRC, Boca Raton, FL (2000).
26Moon, T. K., and Stirling, W. C., Mathematical Methods and Algorithms
for Signal Processing, Prentice Hall, Upper Saddle River, NJ (2000).
27Najmi, A. H., and Magruder, S. F., “An Adaptive Prediction and
Detection Algorithm for Multistream Syndromic Surveillance,” BMC
Med. Inform. Decis. Mak. 5, 33 (2005).
28Bradley, C. A., Rolka, H., Walker, D. and Loonsk, J., “BioSense:
Implementation of a National Early Event Detection and Situational
Awareness System,” MMWR Morb. Mortal. Wkly. Rep. 54(Suppl.),
11–19 (2005).
29Hutwagner, L., Thompson, W., Seeman, G. M., and Treadwell, T.,
“The Bioterrorism Preparedness and Response Early Aberration
Reporting System (EARS),” J. Urban Health 80, i89–i96 (2003).
30Brillman, J. C., Burr, T., Forslund, D., Joyce, E., Picard, R., and
Umland, E., “Modeling Emergency Department Visit Patterns for
Infectious Disease Complaints: Results and Application to Disease
Surveillance,” BMC Med. Inform. Decis. Mak. 5(4), 1–14 (2005).
31Craigmile, P. F., Kim, N., Fernandez, S., and Bonsu, B., “Modeling
and Detection of Respiratory-Related Outbreak Signatures,” BMC
Med. Inform. Decis. Mak. 7, 28 (2007).
32Ryan, T. P., Statistical Methods for Quality Improvement, John Wiley &
Sons, Inc., New York (1989).
330
33Morton,
A. P., Whitby, M., McLaws, M.-L., Dobson, A., McElwain,
S., et al., “The Application of Statistical Process Control Charts to the
Detection and Monitoring of Hospital-Acquired Infections,” J. Qual.
Clin. Prac. 21, 112–117 (2001).
34Shmueli, G., and Fienberg, S. E., “Current and Potential Statistical
Methods for Monitoring Multiple Data Streams for Biosurveillance,
in Statistical Methods in Counterterrorism: Game Theory, Modeling, Syndromic Surveillance, and Biometric Authentication, A. G. Wilson, G. D.
Wilson, and D. H. Olwell (eds.), Springer, New York (2006).
35Chatfield, C., and Yar, M., “Hold–Winters Forecasting: Some Practical Issues,” The Statistician 37(2), 129–140 (1988).
36Holt, C. E., “Forecasting Trends and Seasonals by Exponentially
Weighted Averages,” Carnegie Institute of Technology, Pittsburgh,
PA, Office of Naval Research Memorandum No. 52.
37Winters, P. R., “Forecasting Sales by Exponentially Weighted Moving
Averages,” Management Sci. 6, 324–342 (1960).
38Koehler, A. B., Snyder, R. D., and Ord, J. K., “Forecasting Models and
Prediction Intervals for the Multiplicative Holt–Winters Method,”
Int. J. Forecast. 17, 269–286 (2001).
39Burkom, H., Murphy, S. P., and Shmueli, G., “Automated Time
Series Forecasting for Biosurveillance,” Stat. Med. 26(22), 4202–4218
(2007).
40Chatfield, C., and Yar, M., “Prediction Intervals for Multiplicative
Holt–Winters,” Int. J. Forecast. 7, 31–37 (1991).
41Magruder, S. F., Lewis, S. H., Najmi, A., and Florio, E., “Progress in
Understanding and Using Over-the-Counter Pharmaceuticals for Syndromic Surveillance,” MMWR Morb. Mortal. Wkly. Rep. 53(Suppl.),
117–122 (2004).
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
The Authors
ALERTING ALGORITHMS
Howard S. Burkom received a B.S. degree from Lehigh University and M.S. and Ph.D. degrees in mathematics from the
University of Illinois at Urbana−Champaign. He has 7 years of teaching experience at the university and community college
levels. Since 1979, he has worked at APL developing detection algorithms for underwater acoustics, tactical oceanography,
and public health surveillance. Dr. Burkom has worked exclusively in the field of biosurveillance since 2000, primarily
adapting analytic methods from epidemiology, biostatistics, signal processing, statistical process control, data mining, and
other fields of applied science. He is an elected member of the Board of Directors of the International Society for Disease
Surveillance. Yevgeniy Elbert received his bachelor’s degree in mathematics from Kiev University in the Ukraine. He
continued his education at University of Maryland Baltimore County, receiving a master’s degree in mathematical statistics
in 1999. He worked as a statistician at Walter Reed Army Institute of Research from 2001 before joining APL in 2006. The
majority of Mr. Elbert’s work is devoted to the development of the Electronic Surveillance System for the Early Notification
of Community-based Epidemics (ESSENCE) as part of the APL research team. Steven F. Magruder is a member of the APL
Principal Professional Staff. He received a Ph.D. in physics in 1978 from the University of Illinois at Urbana−Champaign.
He joined APL that same year and has spent most of the past 30 years developing physics models and detection/signal processing algorithms for a variety of submarine sensing technologies. He also has contributed to the development of tools for
the early recognition of a biological attack, focusing on the evaluation of data sources and algorithms. Dr. Magruder’s current assignment is as program manager for acoustics projects within the Ship Submersible Ballistic Nuclear (SSBN) Security
Technology Program. Amir H. Najmi has been employed at APL since 1990. His main areas of research and expertise
are spectral estimation and adaptive signal processing, wave propagation, and quantum theory of fields and particles. He has published in the
most prestigious physics and geophysics journals on subjects as diverse
as cosmological applications of quantum fields and seismic inversion
for oil exploration. Since 2004, he has worked on algorithm developments for syndromic surveillance. He has published two papers on the
applications of adaptive signal processing to public health surveillance.
Dr. Najmi received a degree in mathematics (the Mathematical Tripos)
Howard S. Burkom
Yevgeniy Elbert
from Cambridge University, and a doctorate (D.Phil.) in theoretical
physics from Oxford University. William Peter received his B.S. in
physics from the University of Southern California and his Ph.D. in
physics from the University of California, Irvine. He has worked at Los
Alamos National Laboratory in nuclear physics and high-performance
computing and was one of the founding scientists of a medical imaging company. He has recently developed algorithms to speed up Monte
Carlo calculations, improve clustering, and use information theory for
data fusion. Dr. Peter came to APL in 2006 and now is working within
Steven F. Magruder
Amir H. Najmi
the disease surveillance initiative on alert detection and data fusion. He
also serves as the Editor-in-Chief of the International Journal of Plasma
Science and Engineering. Michael W. Thompson earned a B.A. in music
and a B.S. in applied physics from Brigham Young University in 1998.
He subsequently earned an M.S. in physics from Brigham Young in 2000
doing computer simulations of nonlinear sound production in trombones. He continued his education at the Pennsylvania State University, where he earned a Ph.D. in acoustics in 2004 doing experimental
studies of nonlinear acoustic streaming in thermoacoustic devices. In
William Peter
Michael W. Thompson
2004, he joined the Acoustics and Electromagnetics Group (STX) in
the National Security Technology Department at APL. He has since
been working in the areas of disease surveillance and underwater acoustics. Dr. Thompson is a member of the Acoustical Society of America.
For further information on the work reported here, contact Dr. Burkom.
His e-mail address is howard.burkom@jhuapl.edu.
Johns Hopkins APL Technical Digest, Volume 27, Number 4 (2008)
331­­­­
Download