Extending Continuous Time Bayesian Networks Karthik Gopalratnam and Henry Kautz and Daniel S. Weld Department of Computer Science University of Washington Seattle, WA-98195-2350 {karthikg,kautz,weld}@cs.washington.edu Abstract Continuous-time Bayesian networks (CTBNs) (Nodelman, Shelton, & Koller 2002; 2003), are an elegant modeling language for structured stochastic processes that evolve over continuous time. The CTBN framework is based on homogeneous Markov processes, and defines two distributions with respect to each local variable in the system, given its parents: an exponential distribution over when the variable transitions, and a multinomial over what is the next value. In this paper, we present two extensions to the framework that make it more useful in modeling practical applications. The first extension models arbitrary transition time distributions using ErlangCoxian approximations, while maintaining tractable learning. We show how the censored data problem arises in learning the distribution, and present a solution based on expectationmaximization initialized by the Kaplan-Meier estimate. The second extension is a general method for reasoning about negative evidence, by introducing updates that assert no observable events occur over an interval of time. Such updates were not defined in the original CTBN framework, and we show show that their inclusion can significantly improve the accuracy of filtering and prediction. We illustrate and evaluate these extensions in two real-world domains, email use and GPS traces of a person traveling about a city. Introduction Many problems in artificial intelligence involve reasoning about complex stochastic systems that evolve over time. Dynamic Bayesian networks (DBNs) (Dean & Kanazawa 1989) are a popular approach, which provided a factored representation of discrete-time processes. Unfortunately, DBNs use a discrete temporal representation, which is awkward and computationally expensive when the absolute time of events is important and observations occur irregularly. In such a case the granularity of each DBN time slice must correspond to the smallest possible interval between observations. For example, if observations are usually separated by hours but occasionally by a second, then a DBN update must be performed every second, unless the absolute time of the observations is to be ignored. Continous-time Markov chains provide a way of describing discrete-state systems that evolve according to exponential time distributions, and are widely used in operations research, astronomy, and biology. The most common form, c 2005, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. called a homogenous Markov process, is represented by a matrix that is quadratic in the size of the state space. A homogenous Markov process can be projected to any realvalued future time point with a single matrix exponential operation, so updates by projection and conditioning on observations are similarly efficient. Nodelman et al. [2002] introduced continuous time Bayesian networks (CTBNs), which are a factored representation of continuous-time Markov chains. As with other graphical representations, CTBN can provide a compact representation of a domain by making explicit many of the conditional independence relationships between variables. Also like other factored representations such as Bayesian networks or DBN’s, the small size of the representation is no guarantee that exact inference is tractable. Indeed, the only known exact inference method for CTBNs is compilation to an explicit homogenous Markov process. Nodelman et al. present an approximate inference method based on the clique tree inference algorithm. Space precludes description of our own particle filtering approximation algorithm. Beyond issues of worst-case complexity (which affect all probabilistic representations), CTBN’s have two significant limitations: First, CTBNs are limited to modeling processes with exponential time distributions, but other distributions frequently occur in practice. Second, the only update method in the original formulation of CTBNs is based on the value of a variable at a single point in time. However, in many systems sensors operate continuously, reporting only when a change occurs. For example, a motion detector is silent until it senses activity and a phone rings only when an incoming call is present. When modeling these systems it is essential to reason about evidence of the form “variable X stayed at value x for the entire interval [t1 , t2 ].” Because of the continuous nature of time, such interval updates are not equivalent to a finite series of point-based updates. This second limitation is an example of the general negative evidence problem that arises in modeling any dynamic system: that is, how to efficiently incorporate observations of the myriad possible events that did not occur. In this paper, we resolve these limitations, presenting two extensions to the CTBN framework. • We show how to model arbitrary transition time distributions using Erlang-Coxian approximations, while maintaining tractable learning. We show how the censored data problem arises in learning time distributions, and present AAAI-05 / 981 a solution based on expectation-maximization initialized by the Kaplan-Meier estimate. • We present a general method for reasoning about negative evidence, by introducing updates that assert no observable events occur over an interval of time. We demonstrate that interval evidence updates can significantly improve the accuracy of filtering and prediction. We illustrate and evaluate these extensions in two real-world domains. The first is a model of person’s use of email, which can be used to predict when the person will reply to an incoming message. The second models the pattern of person’s movements through a city using various modes of transportation. The model is trained on a log of GPS (global positioning system) data, and can be used to predict the time it takes for a user to reach locations of interest, etc. Continuous-Time Bayesian Nets (CTBNs) In this section, we summarize certain key notions about continuous time Bayesian networks (CTBNs) presented in (Nodelman, Shelton, & Koller 2002). A CTBN represents a stochastic process over a structured state space consisting of assignments to a set of local variables. CTBNs describe the dynamics of the temporal evolution of this structured state space in terms of the dependencies among the evolution of the local variables as follows. Homogeneous Markov Process Let X be a state variable with finite domain Val(X) = {x1 , . . . , xn } whose value changes continuously over time. We may define the dynamics of this system in terms of a homogeneous Markov process, X(t), by defining its intensity matrix (IM) as follows: x x −q1x q12 . . . q1n x x x −q2 . . . q2n q ~ x = . 21 Q .. . .. .. . .. . x qn1 x qn2 . . . −qnx P x . Intuitively, the diagonial elements where qix = − j6=i qij specify how long the system will remain in a state, and the other elements specify the probability of transitioning between states. Specifically, the probability density function for the time x spent in a state xi is f (t) = qix e−qi t , with correspondx ing distribution function F (t) = 1 − e−qi t . When the X leaves state xi , it will transition to xj with probability x qij /qix . We can project a initial probability distribution de0 scribed by a vector PX to a future point t by computing 0 ~ PX (t) = PX expm(Qx t). The function expm(.) refers to matrix exponentiation, and is the mathematical operation that considers all possible trajectories of X through its state space up to time t. Subsystems of Markov Processes A subsystem S describes the behavior of the process over a subset of the full space, i.e. V al(S) ⊂ V al(X). The intensity matrix of S, QS , can be formed by considering only those entries from QX that correspond to states in S. Since in general a subsystem is not closed, we can form queries about when the system the will enter or exit the subsystem, and the amount of time spent there. This last quantity, called the holding time, has the distribution F (t) = 1 − PS0 expm(QS t)~ e, where PS0 is the entrance distribution and ~ e is a column vector of ones. Continuous Time Bayesian Networks The joint dynamics of several local variables is captured in the notion of the conditional Markov process, which forms the basic building block of the CTBN framework. A conditional Markov process Y is an inhomogeneous Markov process whose IM varies as a function of the values of a ~ , referred to as the parents set of conditioning variables V ~ the variof Y . For each instantitation of values ~ v to V able Y is governed by an an intensity matrix QY |~ v . The full set of matrices for Y is called its conditional intensity matrix (CIM) Q ~ . A CTBN is then formed by composY |V ing together these local conditional processes. A graphical structure encoding the dependencies between the variables in the system, together with an initial distribution over this state space, and the local CIMs for each variable completely specify the CTBN. The semantics of CTBNs can be understood in two terms. The first is based on viewing the entire system as a composite IM describing a homogeneous Markov process over the joint entire state space via an amalgamation operation which combines all the local CIM to produce the Joint Intensity Matrix. The evolution of the CTBN is then completely specified by this Joint Intensity Matrix. The other is based on a generative perspective where the CTBN is viewed as defining a generative model over a set of events which correspond to variables in the system taking on certain values at specific times. Erlang-Coxian CTBNs (EC-CTBNs) Because CTBNs rely on well-behaved exponential distributions for modeling temporal distributions, they submit to easy analytic treatment. Unfortunately, many real-world distributions can not be modeled accurately with exponentials, thus we seek a representation which combines the benefits of CTBNs with greater expressive power. Our solution is to use phase-type (PH) distributions (Neuts 1981), which also obey the Markov property and hence maintain analytical tractability. Phase-type distributions have been widely used to closely approximate general distributions and hence provide great expressiveness. Specifically, we employ the Erlang-Coxian (EC) distribution, which has the attractive property that it is described by a small number of free parameters which are easily computed in closed form. Phase-Type Distributions An nth -order PH distribution is specified in terms of the absorption time of a corresponding Markov chain consisting of n states (phases), where the i-th phase has an exponentially distributed holding time with rate λi . To fully specify this process one must dictate the entrance distribution, ~ τ = [p01 . . . p0n ], where p0i denotes the probability that the chain starts in phase i, and an n × n infinitesimal genera~ , whose entries, pij , encode the probability of tor matrix, T transitioning from phase i to phase j. The cumulative distribution function is F (t) = 1 − ~ τ expm(T t)~ e. The i-th AAAI-05 / 982 Figure 1: Markov chain for an Erlang-Coxian distribution ~ ·~ component of the chain’s exit distribution, u ~ = T e, is the probability that the final absorbing state is reached by transitioning from the chain’s i-th phase. For many years, researchers have devised methods for approximating an arbitrary distribution, G, with a phase-type distribution P H. We make the common assumption that P H is a good match to G when its first k moments equal ~ such those of G. Thus we wish to find values for ~ τ and T that the first k moments match. Note that this search space is very large, since we must find values for Θ(n2 ) parameters, when considering PH-distributions of order n. Recently, (Osogami & Harchol-Balter 2003b; 2003a) published an efficient and elegant approximation method. Their approach is efficient because it restricts the search to the space of Erlang-Coxian (EC) distributions — distributions whose Markov chain consists of n − 2 phase Erlang distribution followed by a two-phase Coxian+ distribution as specified in Figure 1. Note that the rates λ are the same for all Erlang phases, and that the only way to exit the chain is via the first phase of the Erlang distribution or one of the two Coxian phases. The EC distribution is therefore completely described by just six free parameters, (n, p, λ, λC1 , λC2 , pC ), which can be calculated in closed form given the first three moments of any distribution G. Generalized Markov Processes Let X be a random variable governed by a Markov process whose transition intensities may vary over time (i.e., nonhomogenous). We can model X(t) with a generalized Markov process, X̂, which is a homogeneous Markov process over an extended state space. We construct the generalized process by approximating each holding-time distribution of the original process with an EC-distribution. Each atomic transition of the original process X(t) is then represented by a subsystem of X̂. As a first step, let Gxi denote the distribution over holding times for each value xi ∈ Val(X) in the original process. We now approximate each Gxi with an EC distribution represented by a chain with phases Sxi = {xi1 , · · · , xiNi }, ~ xi and exit distribution u generator matrix T ~ xi . The (homogeneous) generalized Markov process for X is defined as a Sn Markov process over X̂, where Val(X̂) = i=1 Sxi . The number of states in this new process is NX̂ = |V al(X̂)| = Pn N . The dynamics of the generalized process are govi i ~ , an N × N matrix, called the generalized erned by Q X̂ X̂ X̂ intensity matrix (GIM), which is computed by amalgamat~ xi and weighting the transitions being the Markov chains T tween these subsystems by the exit distributions from the respective chains. ~ contains n subsystems — one for In other words, Q X̂ each possible value xi ∈ Val(X). We call these original values of X, the base values of the generalized Markov process X̂. By construction, the GIM has semantics such that (X̂ = xik ) =⇒ (X = xi ), where xik is the k th state in the underlying Markov chain for X = xi . The GIM can now be queried in the same way that a simple intensity matrix can, e.g., in order to determine the distribution over the states in time. Given an initial distribution P̂0 , the new distribution over the states of X̂ is given by ~ t). Note that this posterior distribution is P̂t = P̂0 expm(Q X̂ over all the states of the extended process. The probability PNi of a particular base value Pt (X = xi ) = j=1 P̂t [xij ]. The EC-CTBN Model We can compose generalized Markov processes in the same manner as we did for conditional Markov processes above. Let Y (t) be a generalized Markov process whose dynamics are conditioned on a set of other variables V that themselves evolve as generalized Markov processes. Then a generalized conditional intensity matrix (GCIM) for Y is a set of generalized intensity matrices, one for each instantiation of base values to the conditioning variables Vi ∈ V. GCIMs enable us to model the local dependence of one variable on a set of other variables, in the same way that CIMs did for CTBNs. We now define EC-CTBNs. Let X = {X1 , . . . , Xk } be a set of discrete random variables. An Erlang-Coxian continuous-time Bayesian network (EC-CTBN) is a triple hPX0 , G, Qi. PX0 is a factored probability distribution over the initial values of the Xi . G is a directed graph with whose nodes are X , and Par(Xi ) denotes the parents of Xi in G. Q is a set of generalized conditional intensity matrices con~ taining Q for each Xi ∈ X . Xi |Par(Xi ) The global semantics of the EC-CTBN are understood in terms of the amalgamation operation over the GCIMs, which define the EC-CTBN. The Joint Intensity Matrix defined by this operation represents a Homogeneous Markov Process over the extended state space that includes the states corresponding to the phases of the various EC-chains. However, the evolution of the probability distributions over the extended state space, when projected onto the corresponding base values, reflect the different distributions over the holding times of the variables in their base states. Learning EC-CTBNs Maximum likelihood estimation (MLE) of CTBN parameters is efficient due to the special nature of the exponential distribution. As described in (Nodelman, Shelton, & Koller 2003), the parameters for a variable X can be calculated from two sets of simple statistics: T [x|~ v ], the total amount of time X spent in state x, conditioned on its parent variables taking on the value ~ v ; and M [x, x0 |~ v ], the total number of times X transitioned from x to x0 for each pair of states. Then the MLE parameters of the model are given by 0 q̂x|~ v ]/T [x|~ v ] and q̂xx0 |~ v ]/T [x|~ v ], v = M [x|~ v = M [x, x |~ P 0 v ]. where M [x|~ v ] = x0 M [x, x |~ Extending the expressive power of CTBNs to handle general non-exponential time distributions using hidden state is AAAI-05 / 983 Figure 2: Interval (i) counts toward T [X = 0 | Y = 0]; (ii) is a censored interval associated with T [X= 1 | Y = 0]; (iii) counts toward T [X= 1 | Y = 1]. challenging due to the infinitely-large space of possible trajectories of the hidden variable. In the special case of EC distributions, these trajectories are captured by the Markov chain described in closed form by the six parameters described above. Thus, at first glance, the problem of learning EC-CTBNs might appear to be solved, because the six parameters can be estimated from the first three moments of the data. However, a problem arises with this simple approach, because the time that a variable spends in a state may be governed by more than one instantiation of its parents. Consider the example of two Boolean variables X and Y , where Y = Par(X). Suppose X switches to the value 1 while Y is 0. After a while Y switches to 1 and X remains unchanged, as shown in Figure 2. The problem is how to account for the measurement of the interval (ii). It cannot be directly counted in T [X = 1 |Y = 0], because it ends when Y switched, not because X switched. Nor can the union of intervals (ii) and (iii) be directly counted toward T [X = 1 | Y = 0]. Furthermore, discarding measurements of this form could lead to an arbitrarily-poor learned model, since the observation does tell us that our estimate of q̂X=1|Y =0 should be constrained to account for the fact that in this case X did not switch before Y . Intuitively, we should adjust the length of (iii) to be what it would of have been if Y had not switched first.1 The problem of handling such truncated, or censored, data is studied in the field of failure analysis (Crowder et al. 1994). While classic failure analysis has been concerned with fitting censored data to simple distributions (such as the normal, exponential, Weibull, etc.), there has been some work on fitting phase-type distributions of fixed order to censored data using expectation-maximization (EM) (Asmussen, Nerman, & Olsson 1996). We generalize this approach to fit EC distributions of arbitrary order as follows. We record the training data DX for each variable X as a set of tuples hx, ~ v , t, ci, where x is a value of X, ~ v are instantiations of the parents of X, t is the length of the interval, and c = 0 if the interval ends with a transition of X, or c = 1 (censored) if it ends with a transition of a parent of X. We then assign the weight of each censored element in DH evenly among all the longer non-censored elements (a technique known as the Kaplan-Meier estimate), and calculate an initial estimate of the parameters of the best EC approx1 This problem doesn’t affect CTBNs, because of mathematical properties specific to the exponential distribution. However, this immunity fails when the variables can have arbitrary distributions over holding times. See (Nodelman, Shelton, & Koller 2003). imation of the weighted, non-censored data using Osogami & Harchol-Balter’s algorithm. Then, in standard EM fashion, the current EC approximation is used to re-estimate the values of censored data points, and the process repeats until convergence. The final result is used to define the general~ . ized intensity matrix Q X̂ In order to avoid having to re-design the Osogami & Harchol-Balter procedure to accept as input probability distributions over possible values for the censored data, one can employ a Monte-Carlo version of EM, where the expectation step draws a number of samples for each censored point. In fact, in the experiments described below we found that EM converged to same values even if each censored point was simply replaced by its analytically derived expected value, which led to a very efficient implementation. Handling Evidence over Time Intervals The second major drawback of CTBNs is their inability to model interval evidence, i.e. evidence of the form “Variable X stayed at value xi during time interval [t1 , t2 ].” Interval evidence arises frequently in a variety of practical situations. For example, a motion detector might report that no one entered a room between 8:30 and 10:00AM. CTBN’s inability stems from the semantics of homogeneous Markov process evolution, which is described in terms of the exploration of the trajectories of variables during a given interval. The pos~ for an intensity matrix terior distribution Pt = P0 expm(Qt) ~ Q includes the effects of an arbitrary number of transitions at an arbitrary number of time points in the interval [0, t]. Since this quantification over all trajectories (including those where X temporarily shifted away from xi ) is implicit in use of matrix exponentiation for projection, there appears to be no way to benefit from such evidence. However, transforming the state space once again solves the problem. Indeed, our solution works on both CTBN and EC-CTBN models. The crux of the issue is the need to measure the probability mass of the subset of trajectories consistent with the evidence — or equivalently, preventing probability mass from accumulating in trajectories which violate the interval evidence. The key insight is that we can enforce the constraint X = xi , ∀t ∈ [t1 , t2 ] by creating and conditioning on a new variable which records whether the observed variable, e.g., X, changed state during the interval. Specifically, let X be a set of discrete random variables containing X. Let ζ be a new Boolean variable; intuitively, ζ = 0 will signify that X did not change value anywhere in a prediction interval, while ζ = 1 indicates that there was at least one transition. In other words, ζ partitions the space of all trajectories over X , based on whether or not X transitioned. ~ ζ over We create a new (augmented) intensity matrix, Q S ~ X be an n × n intensity matrix X {ζ} as follows. Let Q for the amalgamated system, and let X be a state variable ~ X ’s vifor which we expect interval evidence. We define Q ~ , to be a n × n matrix whose entries, vij olation matrix, V are defined as follows. If X has different values in states i and j of the amalgamated Markov process, then vij equals ~ X otherwise vij = 0. Intuitively, V ~ the ij th element of Q records the intensity of all transitions where X might change AAAI-05 / 984 ~ ζ ’s four quadrants correspond to the transitions between Q possible values of ζ. In the upper left, ζ remains zero, so transitions which change X are disallowed. The upper right denotes transitions where X changes value for the first time, here ζ also changes from zero to one. The bottom left is all zeros because ζ cannot transition from one to zero. The bottom right is the original intensity matrix since all transitions are allowed once ζ = 1. For example, consider a simple CTBN with two Boolean nodes, X = {Y, X}, where Y is the parent of X, and X is the variable about which we receive interval observations. Let the joint intensity matrix be: −3 6 2 =4 1 0 2 ~ XY Q 2 −4 0 1 1 0 −2 3 3 0 2 7 1 5 −4 Then the augmented intensity matrix is 2 6 6 6 6 ~ζ = 6 Q 6 6 6 4 −3 0 1 0 0 0 0 0 0 −4 0 1 0 0 0 0 1 0 −2 0 0 0 0 0 0 2 0 −4 0 0 0 0 0 2 0 0 −3 2 1 0 2 0 0 0 2 −4 0 1 0 0 0 3 1 0 −2 3 0 0 1 0 0 2 1 −4 3 7 7 7 7 7 7 7 7 5 ~ ζ to propagate an initial distribution Pt1 which We can use Q has ζ = 0. Given the posterior Pt2 , we discard any probability mass assigned to ζ = 1 and re-normalize. This maintains the original CTBN semantics, but enforces the fact that X didn’t change during the interval. Note that we need just one ζ variable to handle interval evidence over an arbitrary number of variables;2 in such a case, the semantics of ζ = 0 is that none of the observed variables changed during the interval. Furthermore, note that since our method simply transforms a homogeneous Markov process into one with an additional variable, it may be used with any inference algorithm. Experiments Our evaluation addresses two questions. First, does the generality and expressive power of our extension to EC distributions provide benefits on real problems? Second, does our method for handling interval evidence matter, or are the effects of such observations insignificant? To answer these questions, we chose two domains: email response behavior and the day-to-day activity of a human navigating between home, work and other locations as recorded by GPS data. Our email domain models a email-response sequence with three variables: U denotes user state and has two values online and offline. Variable, S, conditioned on U , denotes message staleness (intuitively, a measure of how long the 2 Note that while different intensity matrices are needed for dif~ ζ can be generated very quickly. ferent sets of evidence, the Q 200 180 Frequency Histogram 160 Learned EC−CTBN 140 Frequency value, violating the evidence. We now define the 2n × 2n ~ ζ: augmented intensity matrix, Q ~ ~ ~ ~ ζ = (QX − V ) V Q ~X ~0 Q Learned CTBN 120 100 80 60 40 20 0 0 500 1000 1500 2000 2500 Response Time in Seconds Figure 3: EC-CTBN and CTBN approximation overlayed on email response-time data. message has remained unanswered) and has four values {0, 1, 2, ≥ 3} which denote the number of times the user has gone offline without replying to the message. The third variable, E, denotes the email-state and has two values waiting for reply and replied; E is conditioned on both U and S. Given three months worth of the author’s actual email, we found 533 email-response pairs. We represented each of these examples as a variable-length sequence of events (e.g., message arrival, user goes offline, and user replies). We estimated values for U at the time of these events based on his proximal email activity. Then using 5-fold cross validation, we split the data into training sets of 433 and test sets of 100 message-response episodes. In order to evaluate EC-CTBN efficacy, we learned a CTBN and a EC-CTBN for the network mentioned above. (The learned EC-CTBN had 288 states compared to the CTBN’s 16.) We then queried the two models for the probability density over reply events in the hold-out set. Figure 3 shows the EC-CTBN and (traditional) CTBN density function over these reply-times mapped on top of the actual frequency distribution for a representative hold-out set. Visually, it is apparent that the EC-CTBN does a better job of matching the actual distribution. We confirmed this quantitatively by computing the ratio of the probability densities predicted by the two models for every reply in the hold-out set. Averaging the mean over 5 runs gave a value of 1.38, i.e. the EC-CTBN was almost forty percent more likely to predict the correct answer than the ordinary CTBN. In our second test, we evaluated the effect of our technique for handling negative evidence in the form of interval observations. Here we compared the learned CTBN model to an augmented version (described in the previous Section) of the same CTBN model. We tested on the same hold-out set as in the last experiment, but incorporated evidence of the form “User was online from ti to tj .” Figure 4 shows that the likelihood of the data under the CTBN augmented with interval evidence is substantially higher than a CTBN without interval evidence. The latter model predicts a longer response time, since it erroneously thinks the user may be offline and hence less responsive. Similar experiments show an augmented EC-CTBN outperforms a standard EC-CTBNs on the same data. We next consider a different domain: a month’s worth of GPS trace data, recording an individual’s movement about AAAI-05 / 985 Discussion and Conclusions −5 6 x 10 5 EC−CTBN Likelihood Likelihood 4 CTBN Likelihood 3 2 1 0 0 500 1000 1500 2000 2500 3000 E−Mail Response TImes for E−Mails sent When User was Online Figure 4: Average likelihood of evidence vs. the time to reply to email. (The Y axis measures likelihood times 10−5 .) Since it can exploit interval evidence, the augmented CTBN predictions are much more accurate than those of the initial CTBN. Histogram and Learned Distribution: Driving to School 10 Frequency Count for Drive Times 8 Learned Distribution 0.001 6 0.0006 4 0.0004 2 0.0002 0 0 200 400 600 Time in Seconds 800 1000 L I K E L I H O O D 0.0008 Acknowledgements 1200 Histogram and Learned Distribution: Walking to Building from Parking Lot 10 Frequency Count for Walking Times 8 0.001 L I K E L I H O O D Learned Distribution 6 4 2 0 0 200 400 600 Time in Seconds 800 1000 1200 Histogram and Learned Distribution: Entire Trip from Home to School 10 Frequency Count for Total Trip Times 8 0.001 L I K E L I H O O D Learned Distribution 6 4 2 0 0 200 400 600 Time in Seconds 800 1000 Continuous time Bayesian networks are a promising new framework for modeling dynamic processes without committing to a fixed temporal grain size. We have shown how to extend CTBNs to approximately model non-exponential time distributions using Erlang-Coxian approximations. Our method affords tractable learning, but the problem of censored data leads us to use expectation-maximization initialized by the Kaplan-Meier estimate. We further showed how to handle negative evidence in the form of observations that certain variables did not change value over continuous intervals of time. We demonstrated the effectiveness of these extensions by showing that they improved predictive accuracy in two domains: email response behavior and GPS traces of a person traveling about a city. In our current and future work we are using EC-CTBNs to create much more complex models of a user’s work routines; for example, modeling both low-level activities such as using email or preparing documents, and high-level processes such as preparing for a meeting. These more complex models are hierarchically structured and contain both continuous-time and discrete-time nodes, further extensions to the CTBN model we will describe in future papers. 1200 Figure 5: EC-CTBN approximations to travel times for two trip segments and total time to goal. a city by foot, car, and bus. Specifically, we used data from (Liao, Fox, & Kautz 2004) in which raw, time-stamped GPS data is clustered to find locations, i.e., coordinates with large dwell times. Next, a topological movement model is constructed by defining a graph whose nodes are locations and whose edges are called trip segments. At any point the user’s activity is either her current location or the trip segment on which she is engaged. The raw data is hand annotated so that every time point is hierarchically labeled with a GPS coordinate, an activity, and a goal. There are three goals (home, work, and a friend’s house), six locations, and fourteen trip segments. We learn a EC-CTBN with two variables (goal and activity) where activity is conditioned on goal. Figure 5 shows the distributions learned for travel times from home to the parking lot and from there to work. Clearly, an exponential distribution would not capture these distributions. This work was supported by NSF grant IIS-0307906, ONR grant N00014-02-1-0932, and DARPA via SRI grant 03000225. Thanks to Pedro Domingos and anonymous reviewers for insightful comments. References Asmussen, S.; Nerman, O.; and Olsson, M. 1996. Fitting phasetype distributions via the EM algorithm. Scandinavian Journal of Statistics 23. Crowder, M.; Kimber, A.; Smith, R.; and Sweeting, T. 1994. Statistical Analysis of Reliability Data. Chapman & Hall/CRC. Dean, T., and Kanazawa, K. 1989. A model for reasoning about persistence and causation. Computational Intelligence 5:142– 150. Liao, L.; Fox, D.; and Kautz, H. 2004. Learning and inferring transportation routines. In Proceedings of the Nineteenth National Conference on Artificial Intelligence. Best Paper Award. Neuts, M. F. 1981. Matrix-Geometric Solutions in Stochastic Models: An Algorithmic Approach. Dover. Nodelman, U.; Shelton, C.; and Koller, D. 2002. Continuous time bayesian networks. In Proceedings of the Eighteenth International Conference on Uncertainty in Artificial Intelligence., 378–387. Nodelman, U.; Shelton, C.; and Koller, D. 2003. Learning continuous time bayesian networks. In Proceedings of the Nineteenth International Conference on Uncertainty in Artificial Intelligence., 451–458. Osogami, T., and Harchol-Balter, M. 2003a. A closed-form solution for mapping general distributions to minimal PH distributions. In Computer Performance Evaluation / TOOLS, 200–217. Osogami, T., and Harchol-Balter, M. 2003b. Necessary and sufficient conditions for representing general distributions by coxians. In Computer Performance Evaluation / TOOLS, 182–199. AAAI-05 / 986