Detecting Disease Outbreaks using Markov Random Fields CS 15-780 Final Project Jon Derrberry (jonderry) Matt Streeter (matts) Summary: We address the problem of detecting disease outbreaks in real-time based on available data. Our approach involves a type of intelligent smoothing of noisy observations based on Markov random fields. The advantage of Markov random fields over simpler methods such as kernel smoothing is confirmed experimentally. 1. Introduction Our project addresses the problem of real-time detection of disease outbreaks based on available data, for example the records of emergency departments. At any time it is almost certain that we will see symptoms of a given disease somewhere, and if we look at enough data we will probably find a geographical region in which those symptoms appear, by chance, at an above-normal rate. What we would like to do is identify, based on a time series of data running up to present moment, any geographical region where the data is such that the symptoms could not reasonably have occurred by chance. Our approach to this problem is based on Markov random fields. Our problem, goals, and motivations are broadly the same as those of Neill and Moore (2003). One difference is that we focus exclusively on classification accuracy, without rigorously assessing the statistical significance of each prediction. This is a significant drawback, because statistical significance is very important to an epidemiologist. On the other hand, our problem formulation is a bit more general than that of Neill and Moore, and the answers produced by our algorithms are less constrained. The following section provides a formal statement of the problem we address. Section 3 describes Markov random fields, which are central to our approach. Section 4 presents the algorithms we use to generate random instances of the problem for testing. Section 5 describes a Java application we created to perform experiments and visualize the results. Section 6 describes our MRF algorithm, as well as five simpler algorithms we compare against. Section 7 is the experimental comparison, and section 8 is the conclusion. 2. Problem statement The problem we consider involves labeling the vertices of a graph. Each vertex in the graph represents a locale (e.g., a town), and an edge between vertices u and v represents the ability of an infectious disease to spread (rapidly) from u to v. The number of people living at each locale (the population) is known, as is the number of people showing symptoms of a certain disease (the disease count) for time t=0 up until the present. Our task is to determine, based on the time series of disease counts and the populations, whether something out of the ordinary is happening. Specifically, we will be looking for a contiguous set of vertices whose disease counts are unlikely to have occurred by chance. The output of our algorithm will be a labeling of each vertex as either “infected” or “not infected”. This problem formulation is a generalization of the one studied by Neill and Moore (2003) in that we consider arbitrary graphs (not just grids), in that any contiguous subset of the vertices may be infected (rather than only rectangular regions), and in that we present our algorithms with a time series of disease counts, rather than just the disease counts at the current time. In regard to this last point, however, it should be noted that the algorithms we will study make use of disease counts prior to the current time in only a very limited way. Note that in practice the “disease counts” do not need to be actual counts of people who have a certain disease. Instead they can be indirect indicators of disease, such as sales of cough syrup, and the techniques developed here will still apply. Nonetheless we will refer to “disease counts” as if they were actual counts of people having some disease. Table 1 shows the various elements that comprise an instance of this problem. Table 1. Elements that comprise an instance of the outbreak-detection problem. Element Meaning G=<V, E> Graph of locales. V is a set of locales and E is a set of edges connecting pairs of locales between which disease can be transmitted rapidly population(v) The population of some locale v V disease-count(v, t) The disease count of some locale v V at time t {0, 1, ..., tcurrent} 3. Markov random fields In a (first-order) Markov chain, the future of the system depends on the past only by way of the most recent state. Letting Q = {Q0, Q1, ..., QN} be a set of random variables that take on values in a state space S, and letting q={q0, q1, ..., qN} be their realizations, this is expressed by the equation: Pr[qn+1 | q0 , q1, ..., qn] = Pr[qn+1 | qn]. Markov random fields are a generalization of Markov chains which allow qn+1 to be conditionally dependent on an arbitrary (but fixed) subset of q, rather than just qn. When talking about Markov random fields, it is useful to think of the elements of q as arranged in a (directed) graph in space, rather than as a sequence in time. An edge in this graph represents conditional dependence. Letting neighbors(qi) denote the elements of q that have edges pointing into qi, this is expressed by the equation: Pr[qi | q-{qi} ] = Pr[qi | neighbors(qi)]. Given a graph and an observation for each of its vertices, what we would like to do is determine the maximum-likelihood set q of underlying states for each of the vertices. If the graph is a Markov chain, this can be done efficiently using the Viterbi algorithm, however in the general case the problem in NP-complete. Fortunately, however, the special case of the problem that we need to solve is not NP-complete, and can be reduced to a minimum cut problem solvable by standard algorithms. The algorithm we use is due to Boykov et al. (1998), and is described further in section 5.6. For approximation algorithms that has been found to work well in the general case, see Boykov et al. (1998) and also Hochbaum (2001). The mapping from the problem described in the previous section to the problem of Markov random field inference is straightforward. The relevant observations are of course the disease counts. The graph that defines the Markov random field has two directed edges (one in each direction) for every undirected edge in the original graph G that defines the problem. Further details are given in section 5.6. A good online introduction to Markov random fields is provided by Li (1995). 4. Instance generators An algorithm for generating a random instance of the problem described in section 2 consists of two components: an algorithm for creating the graph of locales with their associated populations, and an algorithm for filling in the time series of disease counts. Our graph generation algorithms are described in the following section, and the algorithm for generating the time series of disease counts is described in section 4.2. 4.1. Graph-generation algorithms We use three different graph-generation algorithms: Grid, Random, and Planar. Each of these algorithms takes as a parameter a value Nlocales, which specifies the number of vertices in the graph. Grid. The Grid algorithm arranges the vertices in a square grid with sides of length N . Each vertex is connected to its north, south, east, and west locales neighbors. Random. The Random algorithm assigns random (x, y) coordinates from the interval [0,1] to each of the Nlocales vertices. Each vertex is connected to its six nearest neighbors. The way this is done is that the vertices are visited in a round-robin fashion for six rounds, and each time a vertex is visited it is connected to its nearest neighbor among vertices that are not already its neighbors. Each vertex will be of degree at least six, but the maximum vertex degree will generally be higher. Planar. The Planar algorithm is just like the Random algorithm except than when adding edges we first check to see if the edge would cross any existing edge. If so the edge is not added. In all cases, the populations of locales are drawn from a Gaussian distribution with mean 10,000 and standard deviation 1,000. Negative populations are rounded up to 0. 4.2. Filling in time-series data At each time step, disease counts for each uninfected locale u are drawn from a Poisson distribution with parameter rbaseline*population(u). Disease counts for each infected locale v are drawn from a Poisson distribution with parameter rinfected*population(v). It remains to describe how we determine which locales are infected at each time step. Initially (at time 0) all locales are uninfected. At time tonset, a random locale becomes infected. Once a locale becomes infected, it stays infected for all remaining time steps. Additionally, for each time step after tonset, each infected locale has a probability ptransmission of infecting each of its neighbors. Neighbors become infected independently. Table 2 summarizes the various parameters of our instance-generation algorithm. Table 2. Parameters that are used in generating an instance of the outbreak detection problem. Parameter Meaning Graph-type Grid, Random, or Planar Nlocales number of vertices in graph T total number of time steps tonset time at which infection begins rbaseline disease rate for non-infected locales rinfected disease rate for infected locales ptransmission probability that an infected locale will infect a particular one of its neighbors on the next time step 5. Algorithms This section describes the algorithms we have developed to solve the problem of detecting disease outbreaks. 5.1. Naive algorithm At each time t, the naive algorithm examines, for each vertex v, the likelihood that disease-count(v, t) was generated from an infected locale and compares it to the likelihood that disease-count(v, t) was generated from an uninfected locale. The former is equal to the probability density at disease-count(v) of a Poisson distribution with parameter rinfected*population(v), while the latter is equal to the probability at disease-count(v) of a Poisson distribution with parameter rbaseline*population(v). If the likelihood that disease-count(v, t) was generated from an infected locale is greater than the likelihood that disease-count(v, t) was generated from an uninfected locale, v is predicted to be infected at time t; otherwise v is predicted to be uninfected at time t. 5.2. Nearest neighbor algorithm The nearest neighbor algorithm is identical to the naive algorithm, except that before applying the naive algorithm it performs smoothing based on the nearest neighbors of each vertex. Specifically, the population and disease count of each vertex are replaced by the summed populations and disease counts, respectively, of the vertex and all its immediate neighbors. As will be shown, this algorithm has a significantly lower false positive rate than the naive algorithm. 5.3. Kernel smoothing Like the nearest neighbor algorithm, this algorithm employs the naive algorithm after using smoothing as a preprocessing step. In this case we use kernel smoothing: the population and disease count of each vertex v are replaced by a distance-weighted average of the population and disease counts, respectively, of all vertices in the graph. Each distance is measured as the shortest-path distance to v, and the weights are given by a Gaussian kernel function applied to the distance. We use a Gaussian kernel with mean 0 and variance 1. 5.4. Greedy growth: In this approach, we conduct a greedy search to find a contiguous subset of vertices that maximizes the ratio of the likelihood that the region is infected to the likelihood that it is not infected. Specifically, for each vertex v V we do the following: 1. Let Rv = {v}. 2. Let R’ = Rv. 3. For each vertex w Rv that is a neighbor of a vertex in Rv: 3.1. If likelihood-ratio(R’ {w}) exceeds likelihood-ratio(R’), let R’ = R’ {w}. 4. If R’ = Rv, return Rv. Otherwise let Rv = R’ and goto 2. The function likelihood-ratio(R) is defined as: Poisson diseasecount v,t , r population v infected v R v V likelihood-ratio(R) = Poisson diseasecount v,t , r population v baseline v V v R where t is the time of the current prediction and where Poisson(x, q) denotes the probability density of a Poisson distribution with parameter q at x. The approach just described is similar to Kulldorff’s spatial scan statistic (Kulldorff 1997). The main difference is that here we use a greedy algorithm to find a region which maximizes the likelihood ratio, whereas Kulldorff uses an exhaustive search among all regions from a predefined set. 5.5. Temporal smoothing The temporal smoothing algorithm can act as a wrapper for any of the previous four algorithms. Specifically, this algorithm takes the predictions made by an underlying algorithm and modifies them according to the following rule: if a vertex v is predicted as infected at time t but not at time t-1, predict v as uninfected instead; otherwise go with the original prediction. This simple form of temporal smoothing reduces a lot of spurious false positives and considerably increases the accuracy of each of the previous four algorithms. 5.6. Markov random field approach The main focus of our efforts has been detecting disease outbreaks using Markov random fields. The mapping between the problem of detecting infected locales and the problem of Markov random field inference was briefly discussed at the end of section 3; here we give further details. The graph G of directly defines the neighborhood function of the corresponding MRF (i.e., u is conditionally dependent on v in the MRF iff. u is connected to v in G). The analog of a “sensor model” (i.e., the probability of each possible observation given the underlying state) is simply given by the Poisson distribution. It remains only to describe how we solve the problem of MRF inference. Boykov et al. (1998) show how to transform the problem of Markov random field inference (finding a maximum-likelihood set of underlying states for the vertices) into a multiway minimum-cut problem on a graph. The basic idea turn the original graph G (i.e., the graph that defines the Markov random field) into a weighted graph G’ with special “label” vertices for each state in the state space S. Each of the edges in the original graph G is assigned a uniform weight in G’ (we use a weight of 0.5). Each of the label vertices in G’ (one per state) is connected to each of the non-label vertices with edge of large weight (to be described). We then find a multiway minimum-cut that divides G’ into |S| components, with each label vertex residing in a separate component. The state assigned to each vertex is then determined by the label vertex that lies in the same connected component. The “large weights” just mentioned are chosen so that (a) we are guaranteed that no two label vertices will end up in the same connected component and (b) the multiway minimum-cut will respect the sensor model (i.e., a vertex is more likely to end up in the same connected component as a label vertex that has high probability given the observation at that vertex). See Boykov et al. for further details. Boykov et al. provide a local search algorithm for solving this multiway cut problem that is guaranteed to find a cut with locally minimal cost. However, it turns out that our problem is a simpler special case that does not require the use of local search. In our problem, the state space contains just two states, “infected” and “uninfected”. The multiway cut problem with two terminals is thus just the standard s-t minimum cut problem, which can be solved efficiently using the Ford-Fulkerson algorithm. The time complexity of the Ford-Fulkerson algorithm is O(|E|*wcut), where E is the set of edges and wcut is the sum of the weights of the edges that are cut. We know wcut is O(|E|), so our algorithm runs in worstcase time O(|E|2). In practice, the actual running time of the Ford-Fulkerson algorithm for this application is closer to linear (Boykan et al. 1998) Nevertheless, if better worst-case time complexity is desired, a randomized algorithm is available to solve the s-t cut problem for arbitrary graphs in time O(|E|*log3(|V|)) (Karger 2000). The only reason we do not use this is because Karger’s algorithm is quite complex and difficult to code, whereas coding the Ford-Fulkerson algorithm is straightforward. Thus, although our algorithm as coded is quadratic time in the worst case, it could be made linear time with additional programming effort. In short, the approach we are advocating is near-linear time, even if our implementation of it is not. 5.7. Intertemporal MRF We also consider an “intertemporal” variant of our MRF algorithm. This algorithm is just like the algorithm described in section 5.6, except that we alter the MRF so that locales have neighbors in time as well as in space. Specifically, when making a prediction at time t the graph that defines the MRF will be the union of the graphs created by the algorithm of section 5.6 for times t, t-1, ..., tWS+1 (by the union of a set of graphs, we mean a graph <VU, EU> where VU is the union of the vertex sets and EU is the union of the edge sets). In addition to the edges in EU, the MRF graph will have edges of weight 0.1 between copies of the same vertex at adjacent times. For example, there will be an edge of weight 0.1 from the copy of vertex v from time frame t-1 to the copy of vertex v from time frame t-2. Here WS is a “window size” parameter that indicates how much of the past is considered when making a prediction about the present. The high-level idea is that these extra temporal edges encourage the algorithm to make the same predictions about a given vertex across time, where possible. This is intended to be a more principled alternative to the temporal smoothing procedure described in section 5.5, applicable only to the MRF algorithm. Note that because we have only increased the number of edges and vertices in the graph by a constant factor, the time complexity results described in the previous section still hold. 6. Visualization We have created a Java application for interactive instance generation, application of inference algorithms, and visualization of results. Figure 1 shows a screen shot of our tool. The left panel shows the world at the time step indicated by the position of the slider. Each circle indicates a locale, and the area of each circle is proportional to the population of the corresponding locale. The color of the circle indicates the disease count of the locale: the red component of the color (on a scale from 0 to 1) is proportional to the fraction of diseased people in the locale, and the green component is equal to the fraction of non-diseased people. If a locale is actually infected (independent of its current disease count), there is a white rim around its circle. In the right panel, the predictions made by the currently-selected inference algorithm are shown. Here, the color green indicates a prediction of “uninfected” while red indicates a prediction of “infected”. In this screen shot, all but two locales are predicted correctly. The user interface is intuitive. In the left panel the user selects various parameters that determine how the world is generated, then clicks “Generate World” to generate an instance of the problem. In the right panel, the user can select an inference algorithm and click the “Estimate” button to display its predictions. VCR-like controls are available in the bottom panel to control animation of the time series of disease counts and predictions. Figure. 1. Visualization tool. 7. Experimental evaluation To evaluate the performance of the algorithms described in section 5, we ran each algorithm for two sets 100 trials on problem instances generated with parameter settings (see table 2) Graph-type=Random, T=100, Nlocales = 100, tonset=10, ptransmission=0.01, and rbaseline=0.001. For the first set of 100 trials, we set rinfected =0.002. These values of rbaseline and rinfected are the same as those used by Neill and Moore (2003). For the second set of 100 trials, we make the problem considerably more difficult by setting rinfected =0.0012. Each of the first four algorithms discussed in section 5 (Naive, Nearest neighbor, Kernel smoothing, and Greedy) were run both with and without temporal smoothing, and we ran the MRF algorithm as well as the “intertemporal” variant of the MRF algorithm, for a total of 10 algorithms. Because execution of the “intertemporal” version of the MRF algorithm was quite computationally intensive, we perform only 5 trials on this algorithm; all 9 other algorithms go through the full 100 trials. The results of these experiments presented in tables 3 and 4. Further discussion of the results follows. Table 3. Performance of various algorithms with rinfected =0.002. Algorithm Accuracy Avg. false positives Naive 91.1% 8.27 Nearest neighbor 97.7% 1.79 Kernel smoothing 97.1% 1.41 Greedy 97.9% 1.09 MRF 98.4% 0.79 Naive (temp. sm.) 98.0% 0.75 Nearest neighbor (temp. sm.) 98.6% 0.21 Kernel smoothing (temp. sm.) 97.3% 0.69 Greedy (temp. sm.) 98.2% 0.08 MRF (intertemporal) 98.3% 0.67 Avg. false negatives 0.59 0.55 1.45 1.01 0.85 1.26 1.14 1.96 1.75 1.00 Table 4. Performance of various algorithms with rinfected =0.0012. Algorithm Accuracy Avg. false positives Naive 63.5% 34.3 Nearest neighbor 76.2% 21.3 Kernel smoothing 80.1% 17.3 Greedy 90.2% 5.18 MRF 93.6% 0.98 Naive (temp. sm.) 83.6% 12.3 Nearest neighbor (temp. sm.) 91.2% 5.05 Kernel smoothing (temp. sm.) 93.0% 3.35 Greedy (temp. sm.) 93.7% 0.42 MRF (intertemporal) 94.8% 0.04 Avg. false negatives 2.16 2.53 2.58 4.58 5.40 4.10 3.71 3.68 5.88 5.20 Considering only the first four algorithms in table 3, the greedy algorithm outperforms the remaining three algorithms, followed by nearest neighbor, kernel smoothing, and the naive algorithm. In table 4, the rankings of kernel smoothing and nearest neighbor are reversed. In both tables we see that the large number of false positives produced by the naive algorithm is reduced by using spatial smoothing – either nearest-neighbor smoothing or kernel smoothing – and are reduced even further by temporal smoothing, though at the cost of an increase in false negatives. The effect of temporal smoothing was qualitatively the same on all four algorithms: it decreased the false positive rate and increased the false negative rate. Both of these results were expected. The effect of temporal smoothing on overall accuracy was positive in all four cases. In both sets of experiments the algorithms based on Markov random fields did very well. In the first set of experiments, the two MRF algorithms were outperformed only by the nearest neighbor algorithm with temporal smoothing, which outperformed the MRF algorithm only by a small margin (0.02%). On the second set of experiments, the two MRF algorithms were the clear winners. We believe that if the problems were made still more difficult the gap between the MRF algorithms and the others would increase even more. Figure 2 illustrates one frame of a time series generated on a grid world, along with the predictions made by five of the algorithms given in table 3. The high false positive rate of the naive algorithm, and the elimination of these false positives by various forms of smoothing, are apparent in the figure. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (a) the world QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (c) predictions of nearest-neighbor algorithm QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (b) predictions of naive algorithm. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (d) predictions of kernel smoothing QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (e) predictions of greedy algorithm QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. (f) predictions of greedy algorithm with temporal smoothing Figure 2. (a) shows an example world and (b-f) show predictions made by 5 algorithms. 8. Conclusions We have developed a total of 10 algorithms for detection of disease outbreaks and have compared their prediction accuracy under two parameter settings. The algorithms based on Markov random fields consistently outperform the other algorithms, including algorithms based on simpler forms of smoothing. Furthermore, we are able to obtain high classification accuracy even when the disease rate of infected locales is only 1.2 times the disease rate of uninfected locales, a level of performance that appears to be quite beyond that of the human visual system. On this harder of the two parameter settings, the advantage of Markov random fields is especially apparent. Although we have not derived rigorous measures of statistical significance for the predictions made by our algorithms, and though statistical significance is clearly important to epidemiologists, we are pleased with the high classification accuracy we have obtained. Further work might consider how our approach could be combined with the techniques of Neill and Moore (2003), perhaps as a pre- or postprocessing step. References 1. Boykov, Y., Veksler, O., and Zabih, R. 1998. Markov random fields with efficient approximations. IEEE Computer Vision and Pattern Recognition Conference, Santa Barbara, CA. 2. Karger, D. R. 2000. Minimum cuts in near-linear time. Journal of the ACM 47(1): 46-76. 3. Kulldorff, M. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26(6): 1481-1496. 4. Hochbaum, D. S. 2001. An efficient algorithm for image segmentation, Markov random fields and related problems. Journal of the ACM 48(4): 686-701. 5. Li, S. Z. 1995. Markov Random Field Modeling in Computer Vision. http://research.microsoft.com/~szli/MRF_Book/Chapter_1/node11.html 6. Neill, D. B. and Moore, A. W. 2003. A fast multi-resolution method for detection of significant spatial disease clusters. Carnegie Mellon University Dept. of Computer Science, Technical Report CMU-CS-03-154.