1 >> Asela Gunawardana: So I'm very pleased to welcome Christian Shelton. He's an associate professor at UC Riverside. He got his Ph.D. from MIT and did a post-doc at Stanford before that. He's one of the Pioneers in continuous time modeling. Christian Shelton. >> Christian Shelton: Thank you. So my research is in machine learning and I'm particularly in dynamic systems. So I'm interested in all forms of dynamic systems. For about the past ten years or so, I've been interested in models of continuous time systems, systems that are asynchronous. So let me give you an example of some asynchronous stochastic systems. So phylogenetic trees, so you have genetics and they, you know, different species change at different rates over times. Social networks, I'll give you an example of some social network examples. What I'm going to be spending the next year on in my sabbatical, ICU patients. This is a large stochastic system that you'd like to reason about and control. Software verification for a long time has dealt with models of stochastic systems and some others. What's interesting about all these systems is that they evolve naturally in continuous time. They're discrete events that occur in these systems, and that the rate of these events can change drastically from component to component in the system and over time. So there's not maybe a constant rate of change in these systems. So this talk is organized in sort of three components. The first component is I'm going to try to explain why continuous time is an important modeling tool. So computers themselves are in actually discrete time entities, right, there's a clock that runs on your computer. But just like we use real values when we derive our algorithms, despite the fact they're going to be implemented in the computer and essentially on integer arithmetic, treating time as a continuous quantity is important. So that's what I'm going to talk about first. Then I'll talk about some of the work that we've done in models of such continuous time systems, and then I'm going to show some examples. So here are some theoretical and then I'll have experimental reasons why continuous time is advantageous. So consider the typical discrete time system 2 as a Markov chain. So you have a system that evolves over time. This is sort of a very simple Markov chain. Here we have a row of stochastic matrix describing that chain. That is the probability of staying in the first state from one time to the next is 75%. Otherwise, you move to the other state. If you're in state two, you switch to a 50 percent of the time. Okay. So that's fine if -- so depends on why you've described the system, but if your actual system was in continuous time and it evolved and you just happened to be sampling at this particular rate and you end up with a matrix like this, we can ask the question of, okay, so what would the stochastic matrix look like if we were interested in sort of a twice the sampling rate or half the window size. So that means that we need basically a stochastic square root matrix. So we need a matrix like that. And in this case, that gives us that matrix there. So that's fine. This describes the same system at twice the sampling rate. So good. So now, what if my system looks like this? So the system flips back and forth. We can ask the same question again, and we end one this matrix. And I guess if you're in quantum mechanics, it doesn't bother you. For the rest of us, we don't like having imaginary components to our probabilities. So there is no Markov system at half the rate that is equivalent to this system at this rate. And it's not a feature that I happen to have zeroes in there. If I make them point one, the same thing happens. It's just the numbers are more messy. So there are a couple ways of viewing this. One is that the space of discrete time Markov systems is larger than the space of continuous time Markov systems. There are systems that at a discrete time are Markovian, but there's no continuous time that is Markovian. So if you're viewing your Markov assumption there as a, say, regularization or a convenience, okay, then maybe this doesn't bother you. But the reason that I chose this state is that the underlying system is truly Markovian in this state, you may, if using discrete time, go off an estimated system that actually doesn't correspond to any Markovian continuous time system underneath. >>: But what troubles me about this is that in real life, like social networking, I mean, if this occurred too often, you just simply reduce the 3 [indiscernible] time and then you avoid all these problems. >> Christian Shelton: Yeah, um-hmm, sure. Well, yeah, presuming that you've -- so you have to know ahead of time how small you need it to be and then your computational time grows. I'm going to talk about the computational time in other factors, yeah. That's right. Okay. So the other problems I want to represent don't show up in this sort of flat Markovian system but show up in a structured one. So, you know, if you have N states, and you need an N matrix to describe a discrete time Markov system. But I usually don't describe things in states, we describe them in terms of assignments to variables. So if you have, say, N binary variables, that means you need, there are two to the N different assignments to those binary variables, I need two to the N by two to the N matrix, okay. So the answer is that I need some compact representation for that, because that's not tractable for any reason. And decision diagrams have been used in computer science literature, dynamic Bayesian networks are more common machine learning in AI. But there's some problems here. I'm going to focus on DBNs, because I think that's more familiar to this audience. So here's the simplest DBN I can have. So I have two processes. Process A, you know, is a Markovian process that doesn't depend on anything else. It goes its merry way. Process B depends on A, because if it doesn't depend on A, then I have a really truly simple system. So now let's ask the following problem. Question. So what happens if I unroll that for another time step, and again I ask the similar question, which is what if I instead want a DBN that described the system across two time steps instead of across one. I don't like my sampling rate. I'd like some other sampling rate. So I have this one here and I marginalize out the two variables in between, and I get this. At least if I want to describe it in terms of the DBN, I get this structure. And notice the structure has changed. I have an extra edge here that didn't show up before. What does that mean? It means in some sense that this particular structure was not just a function of the underlying process. It's a function of the underlying process and a particular sampling rate. Okay. 4 So put differently, if I have this underlying structure at half the sampling rate, and now I ask the question what structure could I have marginalized to get here, the answer is there are none. Which isn't to say there isn't a DBN. There's a DBN, but its structure doesn't come out like this with you marginalize it. This independence assumption here is hidden inside of the probability distributions here. It's not representable as in the graphical model framework. So the basic thing here isn't to say necessarily something's wrong but is to say that your structure is therefore sensitive to your time slice width. So if your time slice width truly is something that's, you know, inherent in your process, then fine, that's great. You have a process that actually has a rate to it. But if you have a process that does not actually have some global rate to it, then your structure you've estimated is not some inherent property. So those are two theoretical reasons maybe not to like a discrete time model or to be somewhat concerned about it. Empirically, these are also true. If you talk to practitioners, they kind of know this. So here's the simplest example. I have a process of four variables, okay. The first variable is a Markov process that proceeds as it wants at a rate of approximately one. The next process tries to follow the one above it. The third process tries to follow the second one and the fourth one tries to follow the third one. That is, if they disagree with their parent, they switch relatively quickly. Otherwise, they tend not to switch. Now, I'm going to sample a bunch of trajectories from those and then I'm going to try to learn back the network structure. And I'll use a DBN. So here I'm increasing the number of samples I have, and here I'm increasing the sample width. I'm decreasing my rate of sampling. And you get, obviously, a slightly different structure back every time, but these are pretty indicative structures of the ones you get back. So if I have a lot of samples and a very long width compared to the actual sort of natural rate of the process, I basically learn back a stationary distribution for this process. This one's data, then learn back let's take very good here. If I have a very fine width and I have a lot of I basically learn back the correct structure. And in between, I all sorts of crazy things. But more importantly is this plot. So each of those ones learned, let's run this experiment a few hundred 5 times, and compare the model I get back, how well it predicts future data. some sense, a proxy for KL divergence to the true distribution. In So if on top if I use the correct model, okay, and this is if I use a very finely, fine time model and this is a, let's see, this is sort of the one that's at the natural rate and this is a very core screen time model. And the thing that's interesting here is the correct time slice to select depends on how much data you have. Yes, that's a problem if you're going to go about, if you're going to pick something. Yes, that's inconvenient and annoying, right. So in this data regime, you do better here. So what I'm going to show you is a method that produces this. Now, some of this is a little cheating. The model produces exactly the one from which this data came so it's not surprising I do well. But very tight error bounds and beats them all, okay. All right. So what's the alternative, just to give some background. I think people are more familiar with discrete time models than continuous time models. What's the background? So here's this stochastic matrix here. There are a couple of ways of interpreting the stochastic matrix. Let's take a particular row here. They all sum to one. One view is that if I'm in state one, what this row means is that after one time step, there's an 80% chance I'll be in the same state and a 10% chance I'll be in state -- I guess I have it labeled from zero. 70% chance I'll be at one, 10% chance I'll be in state two. Okay. The other way of viewing this is that in terms of dwell times. So in that I stay in state zero for a geometrically distributed number of time steps, okay, and then afterwards, I switch to one of these two states, proportional to the element in that matrix. It's the equivalent view of the same thing. So the alternative in continuous time system is to describe an interative matrix, an intensity matrix or sometimes a Q matrix, depending on what you like. This is a matrix in which all rows sum to zero. The diagonal elements are non-positive and the off-diagonal elements are non-negative. We have a similar interpretation, there's one row per state. This row here describes what happens if I'm currently in state zero. And the two views are somewhat similar. So this row here means that after an infinitesimally small period of time; that is, S epsilon goes to zero, okay, there's a limit, the probability I stay in the same state over that period of time is one minus this quantity here 6 times epsilon. And the probability I move here is this times epsilon, probability I move here is this times epsilon. That's the infinitesimal generator. Alternatively, I can view in terms of dwell times. So this states that I stay in state zero as an eggs potential, the continuation version of a geometric. With rate 0.24. And that once I leave, of course, I can't come back to the same state. Otherwise, it means I didn't leave. I leave, I go to this state proportional to this amount and this state proportional to this amount. So again, there's an even chance of my going to the two. >>: The first view that you had there, epsilon to be 10, I get a negative -- >> Christian Shelton: >>: Yeah? This is only valid as epsilon goes to zero. So this should be the limit? >> Christian Shelton: Yeah, it's the limit. Sorry, I didn't make that more clear, yes. Yeah. Yes, that's right. It's the limit. Okay. So now how do you use this sort of thing? A standard question to ask of this matrix is to marginalize outer push time forward. I have marginal distribution represented as a row vector over time zero and now to push forward, I simply do a matrix multiplication. That gives me the marginal distribution at time one. If I want the marginal distribution at time two, I essentially do the same thing again, which amounts to multiplying by two squared, et cetera. Okay. So down here, the equivalent question is now I'm using this notation to note that the argument is a possibly real valued number. So I have a row vector here that represents the distribution at times zero. To push forward to time T, I use the matrix exponential. Use that sort of equivalent to that. So I use the matrix exponential there. Matrix exponential, of course, is this tailor expansion there, which I'll touch on a little bit later, or alternatively, it's the solution to this ordinary homogenous differential equation. So sort of the most straightforward differential equation you can answer, ask. Okay. So the first question usually is, well, that seems a lot harder. Differential equations compared to matrix multiplication, that doesn't seem to 7 be any better. So I have a three-state system. So essentially, to solve this differential equation, I'm trying to integrate this. So this is just the derivative of that over time. So I have a distribution here at times zero. I'm trying to get a distribution, say, at time eight. But actually, computationally, not to write the algorithm, but to have the computer actually run the algorithm, this can be a lot simpler. Why? Because I'm not going to do this integration by just sort of some standard Oiler integration. I'm going to do it by some adaptive integration method, in which I take an estimate of what the derivatives are here and what the curvatures are and decide how far I can jump out. So in time periods when these distributions are changing drastically, I will spend a lot of computational time to estimate very carefully what goes on here. But in time periods where things are not changing very much, I will adapt my integration step size and take large jumps. And so computationally, I can get by with taking probably many fewer jumps to get the same accuracy over here than would if I just treated as a discrete system and sampled at some particular rate. I'll touch on that later, so these are like Runga Kutta Fehlberg methods are these sort of adaptive integration methods where you take a bunch of derivatives near your point, you see how far and how fast you can go without increasing your error by too much and then you take adaptive step size. Okay. So what we want to do is build models like this that are for systems that are described in variables, not in terms of flat state space like I was doing before. So just to sort of set a little bit of what we're talking about, so I'm going to talk about a factored model. I'm going to talk about continuous time Bayesian networks, which is the factored model that we developed. There are some others from the verification literature, petri nets and things like this. They tend to be very focused on steady state distribution properties and not on learnability estimation from data. So I don't want to give you the impression that this hasn't been worked on before, but that's sort of the work I'm kind of ignoring here. So basically, what I'm trying to do is I'm trying to describe a distribution of an over trajectories. A trajectory would look like this. If I have three variables, it would say the variables start at this particular moment and then asynchronously at various real-valued times, they switch. This switches from 8 light green to dark green here and then shortly after, this switches from orange to red and this one switches from dark blue to light blue, et cetera. I'm trying to describe a distribution over this. A particular sample trajectory can be described by a finite but unbounded number of switches and the times that those switches happen. Those are real value times and the state after the switch. The evidence I might care about might look something like this. It's the same trajectory. I've just removed parts I didn't know about. So at various instants, I might know the value of certain variables, just for an instant. We call that point evidence. For periods of time, I might know that it was green solidly from here to here and dark blue solidly from there to there. We call that interval evidence. Over some periods of interval evidence, I might actually observe transitions so I know that a transition happened here. And here I know that no transition happened. There are other kinds of evidence you might have. You might know that between here and here, it only transitioned once, but you don't know exactly when. Things like that. You can incorporate all those into this kind of evidence model. >>: This quick time model can do the same thing, right? >> Christian Shelton: It depends what you mean. >>: For vectored model have done that, right. know, different dimensions. [Indiscernible] otherwise, you >> Christian Shelton: Sure, so, I mean, DBN is an example of a factored discrete time model. Yes. >>: And then actually can model the single, or maybe this -- >> Christian Shelton: Yes, so right so you certainly are modeling trajectories. Whether or not you view that you've captured everything, so if I know it's light green here and at this point I know it's dark green, do I know it only transitioned once in between or two or three times in between? So a discrete time model does not tell you what happens between those time points. >>: It's just precision issue, how precise you want to represent the 9 transition. >> Christian Shelton: Right. So the more precise you want to represent it, the more computational time you're going to take to propagate across a particular unit of time. If I want to use a delta T of, you know, 0.001, then to propagate across one time step, I have to propagate 100 times. That's right, yes. >>: [Indiscernible] up three models. So this is a factor model. >> Christian Shelton: I'm going to build a factored model. A factored model essentially means the state at any time is an assignment to variables, right, and so I'm saying as an example, in continuous time, what that would look like is not at this time I have this or at this time I have this and at this time I have this, but continuously over time, a trajectory will look like this. Did that answer it? >>: Yes. >> Christian Shelton: Good. So a CTBN is essentially built on a graphical model framework. It's a graphical model. Each node is a process. Not a random variable, but now a whole process. A Markov -- well, no, kind of a Markov process. Edges here represent instantaneous influence. So the simplest one I can give is this. So I have a process A and a process B. Process A proceeds without caring about anybody else, and a process B depends on process A. So what do I describe, in talk, I have matrix. Not need? Process A is therefore a Markov process. I have to addition to its starting distribution, which I'm ignoring for this to describe its rate matrix. So there's an example of a rate chosen arbitrarily. And for process B, instant -- I have two rate matrixes. So at any given instant, it's rates of changing are governed by the instantaneous at this point state that A is in. So if state A is in state zero, this is the transition rates for B, and if A is in state one, these are the transition rates for B. >>: Just to make sure, you didn't draw a self-edge from A to A, so -- >> Christian Shelton: Right so a self-edge is always implied. If you -- if 10 the state of A does not depend on its state in the instant before, then I don't know what that means. >>: You just run it, right? >> Christian Shelton: Literally, instant by instant. Then I have an uncountable -- so right. So I mean, the discrete time equivalent of true white noise, right, which has infinite power. And so yeah, so I don't mean that, right. Yeah. >>: So self-edges are implied. >> Christian Shelton: Yes, that's right, yes. If you can think of any variable that has sort of some meaning as having some continuity to it, even at a very small interval, that's right. Good. >>: Each [indiscernible] represent a process. >> Christian Shelton: >>: Right, this node is the whole process, that's right. Continuous time Markov process. >> Christian Shelton: That's right. And the whole thing together also represents a continuous time Markov process over the joint space of those two. >>: >>: Switching continuous time. >> Christian Shelton: here, that's right. In some sense yes, the rates of the switch based on this Okay. So this whole thing here describes a joint Markov process over the state space of A and B. And just to give you some idea what the semantics look like, that means I should be able to build a rate matrix over the joint assignments. So AO and B0 is 1B0, et cetera. And I can do that for fairly straightforward way. First of all no two variables are allowed to change at exactly the same instant. This is pretty common if you think of the two events can't happen at 11 exactly the same time. They can have it arbitrarily close together. Not exactly the same time. So in this particular example, that's the anti-diagonal. But in general, there are more zeroes than that. So any case where this assignment and that assignment disagree by more than one variable, the rate is zero. If they disagree on A, then I just look up. So, you know, this is the rate of transitioning from A to zero to A as one, so that goes here, because this differs, stays change, et cetera. And then, for instance, when A is zero and B changes, I can look up those from the green matrix, and when A is one and B changes, I can look up those from the blue matrix. In that sense, I fill in everything except for the diagonal. And the diagonal I fill in just to make sure the rows sum to zero. So that's, in some sense, my semantic meaning behind this is one way of viewing my semantic meaning. Now, I don't want to construct this matrix in general, because it's exponentially large in terms of the number of variables, but, you know, you can at least theoretical think about having constructed it. >>: So [indiscernible] is it an efficient algorithm that can detect the switching between A to B? >> Christian Shelton: What do you mean detect the switching? >>: You're running process in A and running process in B. you know, observation. So I'm just -- So you don't know, >> Christian Shelton: I haven't talked yet about what you do with it. I'm just talking about a formal definition of a joint process. Then we can talk about what sort of questions you might want to ask of the process in a bit. So this is the general equation. Essentially says the same thing. If the two, these are joined assignments. If the two joined assignments differ by only one variable, then you just read it off from the relevant local rate matrix for that. The diagonals happen to be these particular sums and everything else is zero. So I want to point something out here. If you have N binary variables, this joint matrix has two the to N rows in columns. Each row has order N non-zero elements. So my original description is more compact than your standard sparse matrix representation. A sparse matrix representation contains at least one 12 bit of information for every row. And I have sort of a this description here has sort of a polynomial number of information per variable. It's exponential in the N degree of the graph, but it's polynomial in the number of nodes in the N degree of the graph. Just like a standard Bayesian network. So here's a classic example from our first paper. It's purely synthetic generated, but cycles are okay. So whether or not a meaning affects whether or not my stomach's full which affects whether or not I'm hungry which affects whether or not I'm eating. That's okay. So edges here have a causal interpretation. We can argue over exactly what form of causality it is. It's certainly Granger causality. Whether or not it's a stronger notion of causality, well, we can argue about that offline. So deseparation still holds like in Bayesian networks. So a variable's independent if it's non-descendant given its parents. And the similar motion for Markov blanket exists. So you're independent of everything else given your parents, your children, your children's parents. The thing you have to remember is your children and your parents may be the same people, because you have cycles in the graph. Okay. But if that worried you about notation and graph theory, the other sorts of things should be worrying you a lot more, okay? But the notion of given means the entire trajectory. So how does this work? So concentration is independent the hungry, given full stomach. Okay. But I know the entire trajectory from zero to whatever time point I care about a full stomach. I only observe partial of it, that's not true. And this is a little like in hidden Markov model, well, it's harder to say in hidden Markov model, okay. So it's harder to say. You just have to observe the whole thing is what I can say. Otherwise, you don't have a full observation of this variable. Okay. The other important part here is that marginalization does not produce a Markov process. So uptake is a Markov process. It doesn't depend on anything else. If I try to marginalize it out and try to incorporate it into concentration, the result is not a Markov process. This is like in discrete time, I have a hidden Markov model. I have the X states and the Ys that come off of it. If I try to marginalize out the Xs, the distribution of the Ys is not a Markovian process. That's the purpose of having a hidden Markov model, okay? So the same thing is true here. If I marginalize out this variable, the 13 description I'm left with here is not a Markov process anymore. In fact, if I marginalize everything out, the description, the size of the description grows, exponentially. Okay. So this isn't a member of the exponential family, like all good distributions, I suppose. The sufficient statistics are for each variable, for each value, its parents can take on for each pair of values it can take on XI and XI prime. It's the number of times in the trajectory it transitioned from X to XI while its parents had value PAI. And it's the amount of time that variable spent in this particular state while its parents were in that particular state. Those two things are sufficient statistics, and then you get this linear form in terms of sufficient statistics. And the parameters of the distribution. So this is a sum of every variable of every instantiation to its parents, every instantiation to that variable and every other instantiation to that variable. XI prime does not equal XI. Oops, wrong button. Okay. So other questions as a machine learning person you might be interested in are how can you learn such a process. How can you estimate such parameters. So let's assume I give you the structure and I just want you to estimate the local rate matrixes, the Q matrixes for that's trivial. Basically, you have a bunch of multinomial distributions, a bunch of exponential distributions and you just go read off the parameters from the sufficient statistics here. It really is quite trivial. >>: You have to know when the switching occurs from one -- >> Christian Shelton: That's right. I'm saying depends. I'm saying if you have a system in which you have complete data. That is, I observed all variables at all times, then this is trivial. >>: But the example you give -- >> Christian Shelton: What example? >>: [inaudible] maybe. I don't know. I might know those, I might not. I'll cover that section in a moment. So the structure here is also particularly 14 simple. So unlike a Bayesian network, in which learning structure is a somewhat difficult process, okay, it's not true for a CTBN, because cycles are okay. The whole thing that makes Bayesian network difficult is that you can't allow cycles. Therefore, you have to search among the set of acyclic graphs, which is not a nice set to search under. I don't have to certain under that set here. So if I bound the N degree of my graph, there's a polynomial time algorithm for searching for the best graph, and I find the global maximum, because I can consider each variable's parent set independently and just optimize it independently. In fact, you can do that for Bayesian network too if you allow cycles. So for incomplete data, that is, there are at least some time point at come I didn't observe some variables, right. There might be variables I never observed. There might be variables I didn't observe for specific periods of time. I might have only sampled it to some regular rate, but I didn't know what happened in between. Then to learn parameters, you need to use expectation maximization works. I have to estimate the expected sufficient statistics, and I'm back up there. And I'll talk about that a little bit in the next slide. And actually, the structure is not too bad. So structurally works for Bayesian networks too. It's a little more of an art. It's not so bad here mainly because the structure search step is intact. I get to the global optimum so I don't have to worry about as many things. I might be running off here and then maybe I didn't find the global optimum and how do I trade off iterations of the structure search versus iterations of my E-step and things like that. You don't have to worry about that as much. Not to say that you couldn't worry about it, but you don't have to. Okay. So how about for inference. So this is the task of I give a partial trajectory and I want to infer in some sense what happened when I wasn't looking. Or where I wasn't looking. Okay. So I think I mentioned before, the marginalization produces non-Markovian processes. So you can't just do sort of variable elimination style algorithm, because the result of your representation size will grow without bound, as you do that. So furthermore, if I'm trying to do filtering, I can't just push the distribution forward over time, because as I push over any interval, suddenly all the variables become tied together. Just like what happens with entanglement in a DBN. So actually, these things happen in discrete time models too, like in DBNs, it's just that they aren't as apparent. It looks, 15 DBNs look like Bayesian networks, and Bayesian networks are nice this way. So they look a little bit better. But when you actually start like working on DBNs, you find you have all these same problems again. So it's not like I've introduced new problems. I've just made them sort of obvious from the beginning. Okay. So you probably have a favorite approximate inference algorithm. Hopefully, it's on this list. And somebody has done it for continuous time Bayesian networks, and that's all I'm going to say. So expectation [indiscernible], important sampling, particle filtering, Gibbs sampling, general markup chain Monte Carlo, mean field, belief propagation recently and then this one's a little bit special too, continuous time. So I don't have time to cover all of those, and you don't have patience to listen to all of those, I assure you. >>: [Indiscernible]. >> Christian Shelton: I don't know if that makes inference any easier. The structure learning is simpler because of that, because I don't have to run the graph, but if I'm given a graph and the parameters, estimating what happened when I wasn't looking, I don't know if the cycles make things easier or worse. I don't think it changes it much. Okay. So I want to get into a little bit behind this one and this one, because I think they show some interesting things about continuous time processes, and I'm going to show them on a bit of a high level. So one of them's mine and one of them's not. you know, egalitarian about this. That makes me feel sort of more, So filter, the first one is mine and it's filtering so I've now decided to turn the time axis on its head. So here I have three variables. What does filtering look like? Filtering is I want to maintain a distribution over the state of the system given everything I've seen thus far. So I start with some distribution of where I think the system started. I'll represent it like this. And then at some real value time later, I observe this state's blue and that state's red. So what do I do? I need to propagate this distribution forward, and then I need to condition it on the evidence. This is a standard propagating forward. 16 I do have animations. Who put that in there? So then later I'll observe something else, I'll propagate that forward, and then I'll condition that and I'll continue on. So I might have some other evidence. I'm only going to talk about point evidence. This works for non-point evidence, but let me just say it works and we'll move on. So there are a couple things to note here. The conditioning is standard distribution conditions. You have some distribution, you just don't allow it to be certain values and you renormalize. This propagation is by the matrix exponential. If I represent this as a joint vector, I just multiply it by the matrix exponential and I'm good. So this is essentially the step I want to concentrate on. So I'm not going to calculate the matrix exponential directly. I'm going to instead calculate its premultiplication by vector, because it's more numerically stable. Much like it's better not to take a matrix inverse, but instead solve a linear system for the particular thing you're going to multiply your matrix inverse by. So the question is how do you compute that? And Moeller and Vanlone have this great paper it's called 19 dubious ways to calculate the matrix exponential. In fact, it's such a good paper that 25 years later, they wrote 19 dubious ways to calculate the matrix exponential revisited 25 years later, okay? If you're really interested, you should read it. It basically says there's no good way to calculate the matrix exponential. It's just not one of those computations that's amenable. So the Taylor expansion is most obvious thing and it's unstable. Why is it unstable? The Q matrix negative definite. Negative diagonal elements [indiscernible] negative definite. So I have a Taylor expansion that alternates signs and we know you don't want to estimate something with Taylor expansion that alternates signs. So I'm going to show you essentially how to use uniformization to solve that. There are some other methods, Krylov subspace approximation and this integration, which we've played around with. But I'm going to build this off the Taylor expansion uniformization. So let me talk about that just quickly. Idea think this is interesting. So I'm going to take my continuous time system, and I'm going to convert it into a discrete time system. But there are a couple different discrete time systems you might be thinking of. I could imagine I could choose to 17 essentially discreteize time at some rate and calculate the equivalent -- okay. I'm not going to do that. The other is I could talk about the embedded Markov chain. That is, I don't care when things happen. I just care what sequence of events happened. I'm not going to do that either. I'm going to build a different one. So I'm going to let my Q matrix be equal to some scaler. That doesn't look like the -- yes, that's correct. Some scaler times a stochastic matrix M minus the identity matrix or, put differently, I'm going to build a stochastic matrix M by taking Q, dividing it by some scaler and adding the identity matrix. And provided my scaler is greater than or equal to the absolute value of the biggest element in the matrix, the resulting M matrix is a stochastic matrix. It amounts to the system in which I sample times from an exponential distribution with rate parameter alpha and then at each time, I sample the next state of the system from the stochastic matrix. So I can have self-transitions on that stochastic matrix, because it might be that that wasn't enough time to actually get a generation of a next event. Okay. I didn't come up with that. That's old. So now, if I have P to the EQT, that can be broken up like that. So in general, this is sum of two matrixes. In general, E to the A plus B is not E to the A times E to the B. Alas, [indiscernible]. But if they have the same eigenvector structure then they are and ICE has any eigenvectors you'd want to these two [indiscernible] like that. So this is a scaler, this is a scaler. Just precompute that and then this here, I can do with the Taylor expansion on M and now M is positive deaf any of the and so this doesn't have alternating signs and I'm okay. So far so good. That's great. So the essential calculation then is I have the first element is P, the next element is P times M, the next element is P times M times M and then P times M times MM. So I essentially need to compute this. Remember, M is big. M is 2 to the N by 2 to the N, so I don't want to do that. So in fact, even if Q has compact structure, M will have the compact structure, but multiplying by it will destroy any structure that might have been in V so you might have had some nice structure in V, multiply by M and it will essentially destroy that. Okay. So there are a number of things in the Markov chain literature that deal with using some sparse representation, which is great for tightly coupled systems. 18 I want to yo a factored representation, one more similar to, like, say the BK algorithm for Bayesian networks. And that's good for sort of more loosely coupled systems. So let me show you basically what happens. I have P. I want to compute P times E to the QT. Here's an exact way of doing that. Well, if I have an infinite computation time, I take P and multiply it by M and multiply that by M and multiply that by M and then I sum all those guys up with the appropriate weights from my Taylor expansion. There we go. Okay. So I'm not going to do that. Since I've been filtering for a while, I don't have an exact answer so I'm going to start with some approximate answer here, I'm going to multiply by M, but then the result is going to be too big to represent. So before I even compute the result, I'm going to project it back on to the space of distributions that are completely factored. And then I'm going to continue to use that, multiply and project, multiply that and project all the way down. And I'm going to sum all those up. And the question is, I started with something -- this is what I wanted. This is what I started with. This is what I wanted to do. This is what I actually computed. Can I say anything about how these two things relate to each other? Can I bound some sort of error here? And the error shows up in three cases. So there's some error that started off, I'm going to talk about the KL divergence error, okay. The KL divergence error in expectation goes down as you condition on things, unlike the L2 or L1 error, okay. So I'm going to be conditioning. So there's some KL divergence I started with here. This step is an approximation of that step. So I introduce some error there. I introduce that error multiple times. I sum up a bunch of these things, which also introduces some error and I didn't do this for an infinite amount of time, which I was supposed to do, okay. The saving grace here is that M is a stochastic matrix. So with any stochastic system, over time, it tends to couple. That is, it loses its memory. So that means that as I multiply these things together if I start something that's approximate, then over time I'll actually end up same thing. If I let a stochastic system run for a long time and robotic system, I'll end up in the stationary distribution. I've forgotten where I came from. with towards the it's a completely 19 Okay. So that means that if I propagate through M, there will be some -that's supposed to be a sub script. There will be some contraction rate by which my KL divergence shrinks. And the projection here can be bounded boy a constant, and so essentially, got news is that if I have a multiplication and a contraction rate and a constant error a geometric series converges. So the complete bound looks like this. That's lovely, isn't it? Yes, okay. So let's see. I started off with -- you don't want to see the proof? Okay. So I started off with KL divergence I began with. That contracts by some global contraction rate. This is gamma prime, not gamma. I'll explain the difference in a moment. There's an additive factor here, and then this just comes out from the fact I truncated the Taylor expansion. This term, in practice, is very, very small. At left if you're willing to spend a little bit of time at it. So there are basically two questions here. The first is what's gamma prime, and why is it not gamma the contraction rate for the whole thing. And the second is, why can't I just use the Boyen-Koller analysis for DBNs, which essentially does a similar thing. And let me see if I can just quickly say what it is. So the contraction rates. You can think of each local variable having its own contraction rate. So we're going to build off of that. And the reason I can't use the DBN thing directly is that when I do this uniformization, I don't end one a DBN. I end one a mixture of DBNs. And so Boyen-Koller doesn't exactly apply there. So one interesting thing is that the per-step contraction rate scales as one over N with the number of variables. It's actually not good. But if I take the entire process of pushing this forward, the whole process contraction rate does not scale with the number of variables. It's constant. So the details are in the paper. Okay. So let me talk about somebody else's work. Yes, I'm good. So we talk about somebody else's work. So a mean field is this other method, right, for [indiscernible] approximate distribution, and I sort of, I take this distribution, this is a distribution of all the processes and I approximate it in a factored form. So it the product over a set of local distributions. So in their work, they represent each of these Qs as an inhomogenous Markov process. So there are a number of different ways of parameter rising in 20 homogenous Markov processes. particular time is a vector. This is the one that works for them. Mew at a It the marginal distribution at that time. And the other natural thing would be we have the local Q matrix at that time, which is a function of time. That's why it's inhomogenous. But instead, what they do is something a little different. It's sort of like the density of transitions. And I'm not going to get into the details exactly why. But it's certainly related to Q. Okay. So the algorithm is you pick a bunch of Qs. You hold all the others constant, you pick one of them and you try to maximize or minimize the KL divergence between your approximation and the true distribution. It's a variational approach. Okay. So then you work through a lot of math and what do you get? You get that your new mew I -- I'll just do the mew I. I won't do the gammas. Your new mew I at the time is some function of the current mew I, your gammas that you've already computed and then sort of the processes in your Markov blanket. Okay. So why is this good? This is a differential equation I have to solve, but that's good. Because again, I can use some sort of adaptive integration method here. So I pull out, you know, this particular process, and to go estimate its describe distribution, I do some adaptive integration. That means at certain times, I take large jumps. At other times, I work down small and be approximate. This also means that each variable has a different adaptive integration associated with it. So some variables I can reason about very quickly. Other variables, I take time and carefully reason about them, which is good for most systems. I have the weather which evolves at a much slower time rate than the traffic that I'm trying to estimate, than the actual individual vehicles on the road. Okay. So this representation here events ends up being naturally adaptive by variable, by time, and so you're going to save computational effort. Now, I'm not saying you could not do this in a discrete time model. But I think it would be much harder to try to figure out how to do it. It wouldn't be as natural to try to reason about it. I'm going to jump four time steps ahead or five time steps ahead. You could do it, but it's not as natural. Certainly, you'd have to take integer jumps, integer jumps. about two ways in which we've applied this. The first is to monitoring. So I have a bunch of computers. They're hooked and what I want to do is I want to put something on the nick the packets that come in and out and tell you whether or not So let me talk network up to a network here so I analyze you currently have 21 some malicious thing running on your software or some software running on your laptop. Okay. So I'm going to build a particular CTBN. I'm not learning this structure, I'm fixing it. Essentially, I'm going to take the traffic and I am going to separate it by destination ports. We'll assume these are not servers. These are clients. So you know all my web traffic to 80, all my web traffic to some other alternate port, all my DNS traffic, et cetera, by ports. I think I'm going to pick out the top ten ports, or nine ports and one catch-all for everything else. Other than separating it by port, I am not going to care about anything except the exact timings. So I'm not looking at payloads, other than to port numbers, destination port numbers. I'm going to build this as a plate model. So I assume that the traffic in general from your computer is generated from some hidden node that has four states. I'm not going to give any semantic meaning to those. Those are just states that can couple things over time. For each port, there are N ports for each port. There's a hidden variable dictating how that traffic's being generated. And hanging off of that hidden variable, I have four variables, one to indicate packet came in, one to indicate a packet came out, one to indicate a connection was started, one to indicate a connection was stopped. >>: First two, the precise time when the packet is observed? >> Christian Shelton: in, packets out. That's right. So these are timing events here. Packets So we looked around, we found two datasets. The MAWI data set is some pacific backbone data that comes from Japan. And this study, we assumed that we took the -- you see, we took the ten most active IPs. We assumed that that's all the traffic that's being generated from that IP, which is clearly false from this data set. Anything that came from Japan, we didn't see. Anything that went to somewhere else in Asia, we probably didn't see. LBNL has some enterprise traffic. I don't know what enterprise network it's from. It might be LBNL's network. They might have gotten it from someplace else. I can't remember. So we took that. This is at the router's inside the networks. So this is 22 probably a reasonable approximation of everything that happened for those hosts. We, I think, split the data 50/50. We trained on the data assuming it was clean. So we built a model of what the normal traffic is that comes out of this computer. We took the test set data. At certain periods of time, we inserted worm traffic from running a worm and gathering the traffic comes off and inserting it in there. Now, these worms are pretty easy to find. They tend to just go [indiscernible] and spam a bunch of packets so we scaled them back down to one percent or 0.1 percent of their natural running rate so they blend into the background to a more difficult problem. And own over a sliding window of 50 seconds, we calculate what's the probability under our model of that 50-second window of events conditioned everything seen thus far, okay. And if that probability is too low, we say that's abnormal, that's strange. Something strange napped this window. Okay. So here are our -- let's see. They're ROC curves. So here are ROC curves. These are the two datasets. These are three different worms of various forms. Our line is the black line that's on top. I'm shocked. Okay. So that's our model. Notice the false positive rates here from go from zero to 0.1, and the true positives go from zero to one. We compare it against a number of other standard machine learning techniques. This dashed green line, which you may or may not be able to see, is nearest neighbor based on some features proposed in the network literature. Actually, it was a paper in the network literature. The one that actually beats us at one point is a connection counts, just count the number of connections and if it's too big, definitely something strange is happening. Let's see. This is a parse and density window estimator, sort of built on the same thing as nearest neighbors, and the purple one is an SVM with a kernel designed nor sort of anomaly detection method. Okay. So there's an example of using those to detect network traffic. We've also used it to detect where it came from. So we took the same ten hosts and then we took a 50-second window of traffic and asked it to say which, under which host was this most probable. So imagine they all sit behind nat, right, and we can fairly accurately describe which host it came from. So we can do host identification. 23 >>: Can you explain why for this specific worm you -- >> Christian Shelton: Yeah, so why did the LBNL and the mydoom worm? Yeah, we looked at it. It wasn't entirely clear to us. I agree that's strange. And we couldn't figure out -- why did we want to know? We want to know because then we can prove our method, right. What is it we don't know? It wasn't clear just exactly what that combination was doing there, because you notice if you unilaterally change either of the two dimensions, you do fine. It wasn't clear to us. There were millions of packets that we couldn't go look to them all, but at least initially, we don't know. >>: If you use up the [indiscernible] time discrete decision, do you prove something close to this. >> Christian Shelton: Sure. So the CTBN is the limit of the DBN as the time width goes to zero. But it's computationally more efficient, as time goes to zero, your computational time also blows up here. >>: [inaudible]. >> Christian Shelton: Here, you mean like this? So the answer is essentially that as you vary the threshold, you suddenly grab a bunch of -- you suddenly grab a bunch of the traffic and that's all you can get. That make sense? So some of the time windows, the same threshold will instantly push you across them. That wasn't helpful. So the threshold is on the probability of that window, right. So certainly, I drop the threshold over here, the probability is really, really high. Really, really low, sorry. The probability is really, really low. As I increase that probability -- it's one of the two. As I increase that probability, basically I move from here to here. There aren't sort of -- there aren't very many operating points in between. >>: But a simple computation, what [indiscernible] comparable complexity? >> Christian Shelton: Yeah, that's a good question. We didn't do that. So one thing you'll notice is that you have to use an approximate inference mode here. We're using a [indiscernible] block wise particle filter, actually. So I didn't tell you that's another one you can use. And if I went off and implemented the sale thing in discrete time, you could have distinct qualms about how I chose to implement that particular one. So I don't know what the equivalent one. 24 But one thing is the rates here do change drastically. The computer's off, the rate of events happening is very, very slow. Then the computer comes on, the rate of event happening is very, very fast so you'd have to have a pretty small time window to capture a lot of the stuff that happens here, because there are times when you go, I'm really capturing every packet, whether or not they're only microseconds apart or millisecond, at least, apart. And if I wanted to time slice at that width, this would be intractable in a DBN. >>: The question is [indiscernible] is the performance approaching this performance? >> Christian Shelton: Yes. As I said, certainly, if I took the time slice width to go to zero, and I had -- I don't have that much computation time this year. But if I did, I would get these results. The CBTN is truly the limit of the DBN as the time slice width goes to zero. >>: It's very important to know whether the time really pays off under the comparable computational ->> Christian Shelton: What I'm saying is the only comparable one I have here is I time sliced it the smallest width between two events. So I have events. If you I have a packet emission and another packet emission, if I want to capture them both in the DBN model, I have to sample at a rate that narrowest width. If I sample at that rate, there's no way I can compute this in a year or two. >>: You increase the sampling rate but you aggregate between the sample times. >> Christian Shelton: So I can do that. So I can try to aggregate. Then I have a DBN that's a little different, right. Then I'm saying it's Markovian in the number of samples that have happened between here and here, and then you have a different model than I have. So then I can't -- if you're just talking about a comparison on a computational point of view, I can't make a comparison there, because you're saying it's Markovian in the number of samples that have happened in the past time width, and I'm saying it's Markovian, right, in this global state that went on. So you really have a different kind of model. This is the hard part about, like, comparing the two is if I time slice it finely enough, I can't compute the DBN one. And if I don't, if I do something 25 like that, now we have sort of Markovian assumptions and, you know, yeah, we have other problems comparing. >>: [inaudible]. >> Christian Shelton: Yes, so all of this was taken from one week. >>: So I don't understand how you can say the traffic different times of the day might be very different. How would you detect that? >> Christian Shelton: So that's why the hidden variable is here. So we don't automatically do anything about it. We're hoping the hidden variable captures that kind of semantic meaning. That is from the past window, I'm going have my current state, have some estimate of, let's say, G that captures the fact that it's currently, you know, Monday afternoon and things are different Monday afternoon. >>: So were four states enough to capture -- >> Christian Shelton: I'll say, yeah, four states were enough and it was also enough that we could do the computations. So it was this balance between expressability and computational power, yeah. Now, again, this is traffic across, I think it was a week. It might have even been shorter than that. So I don't want to claim some broad thing about, yeah, this would work across months or something like that, yeah. This is also done about four years ago when our ability to do exact or approximate inference and exact inference wasn't as good. So I think we can crank these numbers up now with better numeric algorithms, yeah. So last one is social networks. So a lot of people look at static social networks. In fact, a lot of really smart people looking at static social networks. So I don't do that, because I don't want to compete with really smart people. So actually, a lot of smart people are looking at dynamic social networks too, but there's fewer of them. So here's the idea is that I'm monitoring the communications, let's say, in a social network. Either I see people's emails or I see people's phone calls or I see people's Facebook postings or whatever it is, okay, depends on what institution you live in, which is a reasonable 26 model. And what I want to do is I want to estimate the changing underlying social network. Okay. All right. So what we do is we basically build a generative model of the social network of the actors' internal parameters and of the observed communication patterns. We take that model, we conditioned on the observed communications we actually saw and we try to reason about what the social network might be. So we call this the hidden social network model. It's built on some work in sociology so sociology has been looking at social models for a long time. And they even have continuous time Markov models of how social networks might change. We took the one from Snijder's. It's the network attribute co-evolution model. So it essentially says that the network evolves. So links between two people change based on the attributes of those people. So if I smoke and you smoke, then chances are we'll form a -- there's a higher chance we'll form a friendship than if not. And my internal attributes, like whether or not I like football might change based on whether or not my friend like football. Okay. So the network attribute co-evolution model broadly looks like this, two kinds of variables, YIJ is whether or not there's a directed link from I to J at a particular time. And ZI is whether the attribute of actor I in a given instant. Yep. >>: How do you define the network that there is [indiscernible] so definitely if we talk now, there is ->> Christian Shelton: So I'm going to define the model, but I'm not going to observe that variable. Make sense? >>: So how would you verify it? >> Christian Shelton: So I'll talk about the verification in a moment. It gets a little tricky, yeah. Yes, in fact, I'd love to have a better dataset in which to do it. But I'll show you what we can do. Okay. So the model from Snijder's, best described as sort of a forward sampling model. Every actor has a rate of change. When their rate comes up, 27 you know, the event fires. They look at their current network and their current attribute, their local like who they're friends with and their local attribute and they consider any unilateral change. So I make or destroy one friendship, or I change my attribute by one value. >>: This is the [indiscernible]. >> Christian Shelton: >>: This is continuous time. So one person will be one node? >> Christian Shelton: That's right, yeah. So I compute those utilities. Some are bigger than others. I put them essentially into a Boltzmann distribution, which is basically a soft max, and I pick the one that's essentially, including the one I'm currently in, essentially soft max. So if I'm currently in a local minimum or maximum in this case, I tend not to move away from it, but I might. That's the model. The only question is what does this utility function look like. He essentially proposes it should be a linear function of some things and the ones we use our popularity number of human links, similarity of your attribute to your friends, stuff like that. Okay? All right. So essentially, I have one variable for every possible lipping in this network. So they're N squared variables. So you recall I just have ten actors. That's roughly 100 variables. They're all binary. That's a state space of two to the hundredth. So I'm definitely not representing this thing exactly. All right. And then to add communications, okay, so there's a CTBN that describes the relationship between these. It's kind of hard to describe. It's essentially involves contact sensitive independence. So I'm not going to describe it, but it essentially amounts to a CTBN, the social model I just described. So what I do is add a communication variable here. And it's tied only to these two. So this is the communication pattern between I and J, and it depends only instantaneously on whether or not I considers J to be a friend and whether or not J considers I to be a friend. So this might be, you know, they might have a few states like they're calling each other, they're not calling each other, send a text message, sends an email. You know, that sort of thing so you have a number of states about what 28 the communication is at any given instant. So these change fairly rapidly. time. These change certainly much less rapidly over All right. So here's the dataset we used. This is the reality mining dataset. We actually used the first version of the dataset. There's a second, more complete version out. Essentially, some people at MIT convinced a number of students to put on their mobile phone an application that monitored when they took particular -- who they called when and when they sent messages. Actually, monitored a bunch of other stuff too, but we're ignoring that part for this one. That was over the course before a year. We chose everybody in there had sort of essentially had a valid phone number. We don't know the phone numbers themselves but the data was kind of inconsistent in some way, we threw out anyone who was kind of inconsistent. We resulted with 25 people from Sloan business school, 54 people from the MIT media lab. This is not surprising. Those were the two groups involved with setting up the study. And then 13 people who we don't know their affiliation, because they were not enrolled in the study. So these people who did not choose to be part of the study, but more than one person they knew chose to be part of the study and called them at some point. Yes? This is important to understand. So nothing was running on their cell phone when it happened, but some of their friends were blabbing about what they were doing, okay. So we only used these phone calls and text messages. So we learned a bunch of parameters. We take all that data. We only observe the communication patterns. We do EM to estimate the parameters, and I'm going to give you parameters and then we'll do something else with it. So first week of the network dynamics, this is from Schneider's model. We get the rate of changes, everything here is in units of days. Well, this one is. These are just unitless numbers. So this essentially states that you don't tend to make random friend. So all of the things, you're intending not to propose a friendship with someone random. This says that you really tend to propose friendships to people who are already friends with you and that activity and popularity, which are sort of measures of the number of people connected to you who are connected to somebody else are not as important. An interesting thing is we've tried this same model on other kinds of datasets 29 so there's one that has some panel data where they interviewed or surveyed set of teenaged girls, like early teenaged girls in some school somewhere in Europe. I think it was Scotland. Can't remember where. And they asked them a number of attributes, like year one, year two, year three to see how the friendships change. We actually get kind of similar numbers out here. interesting. >>: That was kind of Here the [indiscernible] means the communication. >> Christian Shelton: This is the rate at which you propose a change to your network. So that means on average, once every 40 days is what this is. >>: So that's part of the Q matrix? >> Christian Shelton: It's related to it. The relationship is a little complicated because it involves this. Then when you go to make a change, you then score any change you can make. I could drop you as a friend, I could add him as a friend. I could drop you as a friend, or I could add him or add him or add him. I consider one unilateral change. I score them all, and then I roughly pick the one that's max, that gives me the back score. And the question is how do I score them. And the answer is the resulting network I score according its density, the local network density, the reciprocity, the activity in the popularity, I combine them with these linear weights. So this says I tend to prefer things that have -- I tend to move to networks that I have reciprocity in. Now, the communication pattern, these are the rates of the communication pattern. So that's the rate for the underlying social network. Here's the rate for the communication patterns. So this is communication from K to L. This is whether or not neither consider each other a friend. K doesn't consider L to be a friend, but L considers K to be a friend. The reverse and they both consider each other to be a friend. And so if we'll just take this line here, this essentially means that the average time run expectation, they tend to contact each other once every three 30 to four days. 80 percent of those are phone calls, 20 percent are text messages. This was 2004. Text messages weren't as popular then, I guess. And the average conversation here, this is the end rate for a conversation. The average conversation ends in about five minutes on average, right. And you notice the rates here different by huge numbers of orders of magnitude. I'm not going to be able to capture these things very well efficiently in a uniformly time slice model. Okay. So then if we fix these parameters, we can go back and ask, okay, what's our estimate, essentially a smooth estimate. I have all this observation. What's my estimate at this time step of what the social network looked like? Okay. And so here's the estimate, for instance, at August 19, and November 17, actually at midnight because it's a continuous time number. And the February 15th. And now here's what I'd love if they'd gone back and asked people what their friendships were so I could go validate it and I don't have that information from this dataset, and it's hard to find a dataset that has good information like that. So all I can say, doesn't this look reasonable, and it's not a very -- yeah, it's not as convincing. I'll perfectly admit that. So one thing to note is that the algorithm did not know these groups. And we can see they have more dense, I mean, they were just all given a random ID. So the algorithm tends to cluster, you know, the Sloan people know each other and the media lab people know each other. Business school students are more social than media lab students. There's a selection bias there, right. And furthermore, these 13 people who we don't know who they were, they seem to be related to the Sloan business school students. But here's something interesting. I'm estimating social network connection -this is a heat map. I should make it clear. White means we're pretty sure there's not a connection. Black means we're pretty sure there is a connection. Orange is reasonably sure and yellow is not so sure, okay? I'm estimating here friendships among two people whom I've never observed the communication pattern between them, because their phones were not monitored. I've only observed when they called someone in this network. 31 Furthermore, the people who were in this study came in and out of the study. It not like I observed them continuously over a block of time. So went and performed this study here and dropped this study here. I'd love to be able to verify these. So how do I do that at all? It's not that I observed they didn't communicate to each other. It's I didn't observe whether or not they communicated to each other. But I have certain notions of what social networks should look like in terms of reciprocity, in terms of communicativity and stuff like that in terms of the social network and that at least gives me an estimate here. Now, I don't know how accurate that estimate is. These are really hard. They didn't even agree to participate in the study. I can't go track them back down. But I'd love to know whether or not, you know, because there's this one here that's solid. >>: You have the phone numbers, right? >> Christian Shelton: No, I have anonymized versions of their phone numbers. They got, you know, they got hashed, one-way hashed on to some number. Call them up. Hi, back in 2004, did you happen to know somebody with this phone number? Oops. Oh, no, that was terrible. Hang on. I'm essentially done here so let me just go here. Okay. So I'll give two plugs. One is the code for, almost all the CTBN algorithms, I have available on a website. We're hoping to release a new version of it soon. The current version that's there is not as numerically fast as we'd like. We completely redid the whole matrix package with Eigen, which is a pretty fast matrix package and works much better. I'm giving a UAI tutorial on continuous time processes along with Gianfranco Ciardo, who is a professor in verification to give sort of the other side of this. These kind of models have been used in verification, Petri nets, so that sort of stuff. We'll do a tag-team kind of tutorial. See some of the same slides, but not all of them at UIA. The last thing is I've tried to at least argue the case to avoid time slicing. There's certainly some cases where your data is naturally time sliced. So if you want to model the daily high temperatures, there's a natural time rate for that, right? Day by day. In fact, it isn't a continuous sort of thing, okay. So there are certainly cases where discrete time is the only way to go. But if the underlying process is continuous time, I think at least you should at least admit it, just like, you know, I'm going to go implement this algorithm in a computer, but I nevertheless think I have infinite position floating point numbers when I go to analyze the algorithm and develop it. 32 >>: [inaudible]. >> Christian Shelton: That's harder to say. You notice I haven't covered continuous state. So it's a discrete state. So we've done some work on continuous state. The common filter has a continuous version and these sorts of things are stochastic differential equations, and, you know, the classic option pricing is built on stochastic differential equations. They do it exactly this. They treat it as a continuous time process, yeah. >>: [inaudible]. >> Christian Shelton: I've only done discrete time. We can talk later about the continuous state one, but finite amount of time in the talk. >>: You probably answered this throughout the talk, but I'm not very fluent in this. So you've made a good case against sampling time because things have been rapidly [indiscernible] for a while. But what's the objection to thinking of a model where just every time an event occurs, that's my time. >> Christian Shelton: So you're saying what I could do is I could build, it's called the -- the underlying Markov process that's on the -- the skeleton. The underlying skeleton, right. You can do that. The question is you might care about the timings. I might care -- so you say I go from state one to state two to state one to state two to state one to state two. But I might care that when I'm in state one I stay there for three times as long as when I stay there in state two. >>: Okay. [indiscernible] of the variable, feature of your -- >> Christian Shelton: You can do that. You don't end one a Markov process. You end up with maybe a -- yeah, semi Markov process or something else like this. One of the large draw backs behind Markov processes is discrete time or continuous time, is that the dwell time in a state either has to be geometrically exponentially distributed. The geometric exponential distribution is the only self-similar distribution. If condition on having been in this distribution for this amount of time, the amount of time remain is still the same distribution. That's what it means to be Markovian. So if you want something that is not that way, you have to move to at least the semi-Markov process. We've done that a little bit. I haven't 33 shown this here. We've done things where the rates vary cyclically say, based on the time of day. You can incorporate that in here to make sort of these semi-Markov processes. But yeah, if all you care about is the sequence of events and not their timings, then that's right. Then you definitely have a discrete time problem and you should treat it as that. I'm not going to argue against that. I don't know if that answered your question or did I successfully skirt around your question? I'm not trying to skirt around it. The other thing is I might observations -- here's the other way. I have observations that are tied to times, usually. I ask a censor what's his value of a particular time. I don't ask how many events have happened since the beginning of time so I know how to put you in a timeline. So you're talking about a model in which if I made observations, I need to know the number of events that have happened. Whereas it's more naturally the amount of time that's happened. Maybe that's a different answer also. >>: How does this all this relate to this kind of point process? >> Christian Shelton: processes. >>: You can build these things off of plus on point So the [indiscernible] you talk about -- >> Christian Shelton: Is directly related, yeah, to, yeah, points on point rates, yeah, um-hmm. Yep. Any other questions? >>: You mentioned your [inaudible]. >> Christian Shelton: How big can I build this up to? Well, it depends. The social network, right, has a very large state space. If you don't include the communication variables, you just include the other ones, because communication variables I essentially always view, then I've got the state space is two to the 97 squared. 97 people, okay, 97 squared minus 97, okay. Possible arcs and two to that is the state space there. That's big. I'd argue that's decent sized. We're doing sampling on this case and there's a lot of internal structure. Essentially, a few rates are governing a lot of what happens. I assume people are essentially homogenous, 34 okay. For things in which that's not the case, you can do exact inference. You can do exact inference for, oh, at least somewhere between 10 and 15 variables, okay. And then how well you can do approximate inference after that sort of depends on how much time you're willing to throw at it. And what fidelity of your answer you need so you can go up to somewhere between 10 and 100 variables easily, sort of depending on that. And beyond that, you probably have to rely on something else currently. I'll say also that each to make our approximate in a couple years I cap without a problem. But software. >>: year, we get a little bit better about figuring out how methods a little better. So I wouldn't be surprised if come back and say we can do a thousand variables currently, that's probably not feasible for our Can you parallelize it? >> Christian Shelton: The sampling ones are ridiculously easy to parallelize. Yeah. Some of it can be parallelized, yes. So some of the learning can be easily parallelized. That the first one I showed where you're pushing forward and getting the approximation that's an approximate something approximation, that one the matrix multiplications you can do parallelism on it. It's at a much finer grain, harder to do, yeah. It would be -- it would certainly be feasible. We haven't yet looked at that, but yeah. Great. Thank you very much.