1

advertisement
1
>> Asela Gunawardana: So I'm very pleased to welcome Christian Shelton. He's
an associate professor at UC Riverside. He got his Ph.D. from MIT and did a
post-doc at Stanford before that. He's one of the Pioneers in continuous time
modeling. Christian Shelton.
>> Christian Shelton: Thank you. So my research is in machine learning and
I'm particularly in dynamic systems. So I'm interested in all forms of dynamic
systems. For about the past ten years or so, I've been interested in models of
continuous time systems, systems that are asynchronous.
So let me give you an example of some asynchronous stochastic systems. So
phylogenetic trees, so you have genetics and they, you know, different species
change at different rates over times. Social networks, I'll give you an
example of some social network examples.
What I'm going to be spending the next year on in my sabbatical, ICU patients.
This is a large stochastic system that you'd like to reason about and control.
Software verification for a long time has dealt with models of stochastic
systems and some others.
What's interesting about all these systems is that they evolve naturally in
continuous time. They're discrete events that occur in these systems, and that
the rate of these events can change drastically from component to component in
the system and over time. So there's not maybe a constant rate of change in
these systems.
So this talk is organized in sort of three components. The first component is
I'm going to try to explain why continuous time is an important modeling tool.
So computers themselves are in actually discrete time entities, right, there's
a clock that runs on your computer. But just like we use real values when we
derive our algorithms, despite the fact they're going to be implemented in the
computer and essentially on integer arithmetic, treating time as a continuous
quantity is important. So that's what I'm going to talk about first.
Then I'll talk about some of the work that we've done in models of such
continuous time systems, and then I'm going to show some examples.
So here are some theoretical and then I'll have experimental reasons why
continuous time is advantageous. So consider the typical discrete time system
2
as a Markov chain. So you have a system that evolves over time. This is sort
of a very simple Markov chain. Here we have a row of stochastic matrix
describing that chain. That is the probability of staying in the first state
from one time to the next is 75%. Otherwise, you move to the other state. If
you're in state two, you switch to a 50 percent of the time. Okay.
So that's fine if -- so depends on why you've described the system, but if your
actual system was in continuous time and it evolved and you just happened to be
sampling at this particular rate and you end up with a matrix like this, we can
ask the question of, okay, so what would the stochastic matrix look like if we
were interested in sort of a twice the sampling rate or half the window size.
So that means that we need basically a stochastic square root matrix. So we
need a matrix like that. And in this case, that gives us that matrix there.
So that's fine.
This describes the same system at twice the sampling rate.
So good.
So now, what if my system looks like this? So the system flips back and forth.
We can ask the same question again, and we end one this matrix. And I guess if
you're in quantum mechanics, it doesn't bother you. For the rest of us, we
don't like having imaginary components to our probabilities.
So there is no Markov system at half the rate that is equivalent to this system
at this rate. And it's not a feature that I happen to have zeroes in there.
If I make them point one, the same thing happens. It's just the numbers are
more messy.
So there are a couple ways of viewing this. One is that the space of discrete
time Markov systems is larger than the space of continuous time Markov systems.
There are systems that at a discrete time are Markovian, but there's no
continuous time that is Markovian.
So if you're viewing your Markov assumption there as a, say, regularization or
a convenience, okay, then maybe this doesn't bother you. But the reason that I
chose this state is that the underlying system is truly Markovian in this
state, you may, if using discrete time, go off an estimated system that
actually doesn't correspond to any Markovian continuous time system underneath.
>>: But what troubles me about this is that in real life, like social
networking, I mean, if this occurred too often, you just simply reduce the
3
[indiscernible] time and then you avoid all these problems.
>> Christian Shelton: Yeah, um-hmm, sure. Well, yeah, presuming that you've
-- so you have to know ahead of time how small you need it to be and then your
computational time grows. I'm going to talk about the computational time in
other factors, yeah. That's right.
Okay. So the other problems I want to represent don't show up in this sort of
flat Markovian system but show up in a structured one. So, you know, if you
have N states, and you need an N matrix to describe a discrete time Markov
system. But I usually don't describe things in states, we describe them in
terms of assignments to variables.
So if you have, say, N binary variables, that means you need, there are two to
the N different assignments to those binary variables, I need two to the N by
two to the N matrix, okay. So the answer is that I need some compact
representation for that, because that's not tractable for any reason. And
decision diagrams have been used in computer science literature, dynamic
Bayesian networks are more common machine learning in AI. But there's some
problems here. I'm going to focus on DBNs, because I think that's more
familiar to this audience.
So here's the simplest DBN I can have. So I have two processes. Process A,
you know, is a Markovian process that doesn't depend on anything else. It goes
its merry way. Process B depends on A, because if it doesn't depend on A, then
I have a really truly simple system.
So now let's ask the following problem. Question. So what happens if I unroll
that for another time step, and again I ask the similar question, which is what
if I instead want a DBN that described the system across two time steps instead
of across one. I don't like my sampling rate. I'd like some other sampling
rate.
So I have this one here and I marginalize out the two variables in between, and
I get this. At least if I want to describe it in terms of the DBN, I get this
structure. And notice the structure has changed. I have an extra edge here
that didn't show up before. What does that mean? It means in some sense that
this particular structure was not just a function of the underlying process.
It's a function of the underlying process and a particular sampling rate.
Okay.
4
So put differently, if I have this underlying structure at half the sampling
rate, and now I ask the question what structure could I have marginalized to
get here, the answer is there are none. Which isn't to say there isn't a DBN.
There's a DBN, but its structure doesn't come out like this with you
marginalize it. This independence assumption here is hidden inside of the
probability distributions here. It's not representable as in the graphical
model framework.
So the basic thing here isn't to say necessarily something's wrong but is to
say that your structure is therefore sensitive to your time slice width. So if
your time slice width truly is something that's, you know, inherent in your
process, then fine, that's great. You have a process that actually has a rate
to it. But if you have a process that does not actually have some global rate
to it, then your structure you've estimated is not some inherent property.
So those are two theoretical reasons maybe not to like a discrete time model or
to be somewhat concerned about it. Empirically, these are also true. If you
talk to practitioners, they kind of know this. So here's the simplest example.
I have a process of four variables, okay. The first variable is a Markov
process that proceeds as it wants at a rate of approximately one. The next
process tries to follow the one above it. The third process tries to follow
the second one and the fourth one tries to follow the third one. That is, if
they disagree with their parent, they switch relatively quickly. Otherwise,
they tend not to switch.
Now, I'm going to sample a bunch of trajectories from those and then I'm going
to try to learn back the network structure. And I'll use a DBN. So here I'm
increasing the number of samples I have, and here I'm increasing the sample
width. I'm decreasing my rate of sampling.
And you get, obviously, a slightly different structure back every time, but
these are pretty indicative structures of the ones you get back. So if I have
a lot of samples and a very long width compared to the actual sort of natural
rate of the process, I basically learn back a stationary distribution for this
process.
This one's
data, then
learn back
let's take
very good here. If I have a very fine width and I have a lot of
I basically learn back the correct structure. And in between, I
all sorts of crazy things. But more importantly is this plot. So
each of those ones learned, let's run this experiment a few hundred
5
times, and compare the model I get back, how well it predicts future data.
some sense, a proxy for KL divergence to the true distribution.
In
So if on top if I use the correct model, okay, and this is if I use a very
finely, fine time model and this is a, let's see, this is sort of the one
that's at the natural rate and this is a very core screen time model. And the
thing that's interesting here is the correct time slice to select depends on
how much data you have. Yes, that's a problem if you're going to go about, if
you're going to pick something. Yes, that's inconvenient and annoying, right.
So in this data regime, you do better here.
So what I'm going to show you is a method that produces this. Now, some of
this is a little cheating. The model produces exactly the one from which this
data came so it's not surprising I do well. But very tight error bounds and
beats them all, okay. All right. So what's the alternative, just to give some
background. I think people are more familiar with discrete time models than
continuous time models.
What's the background? So here's this stochastic matrix here. There are a
couple of ways of interpreting the stochastic matrix. Let's take a particular
row here. They all sum to one. One view is that if I'm in state one, what
this row means is that after one time step, there's an 80% chance I'll be in
the same state and a 10% chance I'll be in state -- I guess I have it labeled
from zero. 70% chance I'll be at one, 10% chance I'll be in state two. Okay.
The other way of viewing this is that in terms of dwell times. So in that I
stay in state zero for a geometrically distributed number of time steps, okay,
and then afterwards, I switch to one of these two states, proportional to the
element in that matrix.
It's the equivalent view of the same thing. So the alternative in continuous
time system is to describe an interative matrix, an intensity matrix or
sometimes a Q matrix, depending on what you like.
This is a matrix in which all rows sum to zero. The diagonal elements are
non-positive and the off-diagonal elements are non-negative. We have a similar
interpretation, there's one row per state. This row here describes what
happens if I'm currently in state zero. And the two views are somewhat
similar. So this row here means that after an infinitesimally small period of
time; that is, S epsilon goes to zero, okay, there's a limit, the probability I
stay in the same state over that period of time is one minus this quantity here
6
times epsilon.
And the probability I move here is this times epsilon, probability I move here
is this times epsilon. That's the infinitesimal generator.
Alternatively, I can view in terms of dwell times. So this states that I stay
in state zero as an eggs potential, the continuation version of a geometric.
With rate 0.24. And that once I leave, of course, I can't come back to the
same state. Otherwise, it means I didn't leave. I leave, I go to this state
proportional to this amount and this state proportional to this amount.
So again, there's an even chance of my going to the two.
>>:
The first view that you had there, epsilon to be 10, I get a negative --
>> Christian Shelton:
>>:
Yeah?
This is only valid as epsilon goes to zero.
So this should be the limit?
>> Christian Shelton: Yeah, it's the limit. Sorry, I didn't make that more
clear, yes. Yeah. Yes, that's right. It's the limit. Okay. So now how do
you use this sort of thing? A standard question to ask of this matrix is to
marginalize outer push time forward. I have marginal distribution represented
as a row vector over time zero and now to push forward, I simply do a matrix
multiplication. That gives me the marginal distribution at time one.
If I want the marginal distribution at time two, I essentially do the same
thing again, which amounts to multiplying by two squared, et cetera. Okay.
So down here, the equivalent question is now I'm using this notation to note
that the argument is a possibly real valued number. So I have a row vector
here that represents the distribution at times zero. To push forward to time
T, I use the matrix exponential. Use that sort of equivalent to that. So I
use the matrix exponential there. Matrix exponential, of course, is this
tailor expansion there, which I'll touch on a little bit later, or
alternatively, it's the solution to this ordinary homogenous differential
equation.
So sort of the most straightforward differential equation you can answer, ask.
Okay. So the first question usually is, well, that seems a lot harder.
Differential equations compared to matrix multiplication, that doesn't seem to
7
be any better. So I have a three-state system. So essentially, to solve this
differential equation, I'm trying to integrate this. So this is just the
derivative of that over time. So I have a distribution here at times zero.
I'm trying to get a distribution, say, at time eight.
But actually, computationally, not to write the algorithm, but to have the
computer actually run the algorithm, this can be a lot simpler. Why? Because
I'm not going to do this integration by just sort of some standard Oiler
integration. I'm going to do it by some adaptive integration method, in which
I take an estimate of what the derivatives are here and what the curvatures are
and decide how far I can jump out. So in time periods when these distributions
are changing drastically, I will spend a lot of computational time to estimate
very carefully what goes on here.
But in time periods where things are not changing very much, I will adapt my
integration step size and take large jumps. And so computationally, I can get
by with taking probably many fewer jumps to get the same accuracy over here
than would if I just treated as a discrete system and sampled at some
particular rate.
I'll touch on that later, so these are like Runga Kutta Fehlberg methods are
these sort of adaptive integration methods where you take a bunch of
derivatives near your point, you see how far and how fast you can go without
increasing your error by too much and then you take adaptive step size.
Okay. So what we want to do is build models like this that are for systems
that are described in variables, not in terms of flat state space like I was
doing before. So just to sort of set a little bit of what we're talking about,
so I'm going to talk about a factored model. I'm going to talk about
continuous time Bayesian networks, which is the factored model that we
developed. There are some others from the verification literature, petri nets
and things like this. They tend to be very focused on steady state
distribution properties and not on learnability estimation from data. So I
don't want to give you the impression that this hasn't been worked on before,
but that's sort of the work I'm kind of ignoring here.
So basically, what I'm trying to do is I'm trying to describe a distribution of
an over trajectories. A trajectory would look like this. If I have three
variables, it would say the variables start at this particular moment and then
asynchronously at various real-valued times, they switch. This switches from
8
light green to dark green here and then shortly after, this switches from
orange to red and this one switches from dark blue to light blue, et cetera.
I'm trying to describe a distribution over this. A particular sample
trajectory can be described by a finite but unbounded number of switches and
the times that those switches happen. Those are real value times and the state
after the switch.
The evidence I might care about might look something like this. It's the same
trajectory. I've just removed parts I didn't know about. So at various
instants, I might know the value of certain variables, just for an instant. We
call that point evidence. For periods of time, I might know that it was green
solidly from here to here and dark blue solidly from there to there. We call
that interval evidence. Over some periods of interval evidence, I might
actually observe transitions so I know that a transition happened here. And
here I know that no transition happened.
There are other kinds of evidence you might have. You might know that between
here and here, it only transitioned once, but you don't know exactly when.
Things like that. You can incorporate all those into this kind of evidence
model.
>>:
This quick time model can do the same thing, right?
>> Christian Shelton:
It depends what you mean.
>>: For vectored model have done that, right.
know, different dimensions.
[Indiscernible] otherwise, you
>> Christian Shelton: Sure, so, I mean, DBN is an example of a factored
discrete time model. Yes.
>>:
And then actually can model the single, or maybe this --
>> Christian Shelton: Yes, so right so you certainly are modeling
trajectories. Whether or not you view that you've captured everything, so if I
know it's light green here and at this point I know it's dark green, do I know
it only transitioned once in between or two or three times in between? So a
discrete time model does not tell you what happens between those time points.
>>:
It's just precision issue, how precise you want to represent the
9
transition.
>> Christian Shelton: Right. So the more precise you want to represent it,
the more computational time you're going to take to propagate across a
particular unit of time. If I want to use a delta T of, you know, 0.001, then
to propagate across one time step, I have to propagate 100 times. That's
right, yes.
>>:
[Indiscernible] up three models.
So this is a factor model.
>> Christian Shelton: I'm going to build a factored model. A factored model
essentially means the state at any time is an assignment to variables, right,
and so I'm saying as an example, in continuous time, what that would look like
is not at this time I have this or at this time I have this and at this time I
have this, but continuously over time, a trajectory will look like this. Did
that answer it?
>>:
Yes.
>> Christian Shelton: Good. So a CTBN is essentially built on a graphical
model framework. It's a graphical model. Each node is a process. Not a
random variable, but now a whole process. A Markov -- well, no, kind of a
Markov process. Edges here represent instantaneous influence. So the simplest
one I can give is this. So I have a process A and a process B. Process A
proceeds without caring about anybody else, and a process B depends on process
A.
So what do I
describe, in
talk, I have
matrix. Not
need? Process A is therefore a Markov process. I have to
addition to its starting distribution, which I'm ignoring for this
to describe its rate matrix. So there's an example of a rate
chosen arbitrarily.
And for process B, instant -- I have two rate matrixes. So at any given
instant, it's rates of changing are governed by the instantaneous at this point
state that A is in. So if state A is in state zero, this is the transition
rates for B, and if A is in state one, these are the transition rates for B.
>>:
Just to make sure, you didn't draw a self-edge from A to A, so --
>> Christian Shelton:
Right so a self-edge is always implied.
If you -- if
10
the state of A does not depend on its state in the instant before, then I don't
know what that means.
>>:
You just run it, right?
>> Christian Shelton: Literally, instant by instant. Then I have an
uncountable -- so right. So I mean, the discrete time equivalent of true white
noise, right, which has infinite power. And so yeah, so I don't mean that,
right. Yeah.
>>:
So self-edges are implied.
>> Christian Shelton: Yes, that's right, yes. If you can think of any
variable that has sort of some meaning as having some continuity to it, even at
a very small interval, that's right. Good.
>>:
Each [indiscernible] represent a process.
>> Christian Shelton:
>>:
Right, this node is the whole process, that's right.
Continuous time Markov process.
>> Christian Shelton: That's right. And the whole thing together also
represents a continuous time Markov process over the joint space of those two.
>>:
>>:
Switching continuous time.
>> Christian Shelton:
here, that's right.
In some sense yes, the rates of the switch based on this
Okay. So this whole thing here describes a joint Markov process over the state
space of A and B. And just to give you some idea what the semantics look like,
that means I should be able to build a rate matrix over the joint assignments.
So AO and B0 is 1B0, et cetera. And I can do that for fairly straightforward
way.
First of all no two variables are allowed to change at exactly the same
instant. This is pretty common if you think of the two events can't happen at
11
exactly the same time. They can have it arbitrarily close together. Not
exactly the same time. So in this particular example, that's the
anti-diagonal. But in general, there are more zeroes than that. So any case
where this assignment and that assignment disagree by more than one variable,
the rate is zero.
If they disagree on A, then I just look up. So, you know, this is the rate of
transitioning from A to zero to A as one, so that goes here, because this
differs, stays change, et cetera. And then, for instance, when A is zero and B
changes, I can look up those from the green matrix, and when A is one and B
changes, I can look up those from the blue matrix. In that sense, I fill in
everything except for the diagonal. And the diagonal I fill in just to make
sure the rows sum to zero.
So that's, in some sense, my semantic meaning behind this is one way of viewing
my semantic meaning. Now, I don't want to construct this matrix in general,
because it's exponentially large in terms of the number of variables, but, you
know, you can at least theoretical think about having constructed it.
>>: So [indiscernible] is it an efficient algorithm that can detect the
switching between A to B?
>> Christian Shelton:
What do you mean detect the switching?
>>: You're running process in A and running process in B.
you know, observation.
So I'm just --
So you don't know,
>> Christian Shelton: I haven't talked yet about what you do with it. I'm
just talking about a formal definition of a joint process. Then we can talk
about what sort of questions you might want to ask of the process in a bit. So
this is the general equation. Essentially says the same thing. If the two,
these are joined assignments. If the two joined assignments differ by only one
variable, then you just read it off from the relevant local rate matrix for
that. The diagonals happen to be these particular sums and everything else is
zero.
So I want to point something out here. If you have N binary variables, this
joint matrix has two the to N rows in columns. Each row has order N non-zero
elements. So my original description is more compact than your standard sparse
matrix representation. A sparse matrix representation contains at least one
12
bit of information for every row.
And I have sort of a this description here has sort of a polynomial number of
information per variable. It's exponential in the N degree of the graph, but
it's polynomial in the number of nodes in the N degree of the graph. Just like
a standard Bayesian network.
So here's a classic example from our first paper. It's purely synthetic
generated, but cycles are okay. So whether or not a meaning affects whether or
not my stomach's full which affects whether or not I'm hungry which affects
whether or not I'm eating. That's okay.
So edges here have a causal interpretation. We can argue over exactly what
form of causality it is. It's certainly Granger causality. Whether or not
it's a stronger notion of causality, well, we can argue about that offline.
So deseparation still holds like in Bayesian networks. So a variable's
independent if it's non-descendant given its parents. And the similar motion
for Markov blanket exists. So you're independent of everything else given your
parents, your children, your children's parents. The thing you have to
remember is your children and your parents may be the same people, because you
have cycles in the graph. Okay. But if that worried you about notation and
graph theory, the other sorts of things should be worrying you a lot more,
okay?
But the notion of given means the entire trajectory. So how does this work?
So concentration is independent the hungry, given full stomach. Okay. But I
know the entire trajectory from zero to whatever time point I care about a full
stomach. I only observe partial of it, that's not true. And this is a little
like in hidden Markov model, well, it's harder to say in hidden Markov model,
okay. So it's harder to say. You just have to observe the whole thing is what
I can say. Otherwise, you don't have a full observation of this variable.
Okay. The other important part here is that marginalization does not produce a
Markov process. So uptake is a Markov process. It doesn't depend on anything
else. If I try to marginalize it out and try to incorporate it into
concentration, the result is not a Markov process. This is like in discrete
time, I have a hidden Markov model. I have the X states and the Ys that come
off of it. If I try to marginalize out the Xs, the distribution of the Ys is
not a Markovian process. That's the purpose of having a hidden Markov model,
okay? So the same thing is true here. If I marginalize out this variable, the
13
description I'm left with here is not a Markov process anymore. In fact, if I
marginalize everything out, the description, the size of the description grows,
exponentially.
Okay. So this isn't a member of the exponential family, like all good
distributions, I suppose. The sufficient statistics are for each variable, for
each value, its parents can take on for each pair of values it can take on XI
and XI prime. It's the number of times in the trajectory it transitioned from
X to XI while its parents had value PAI. And it's the amount of time that
variable spent in this particular state while its parents were in that
particular state. Those two things are sufficient statistics, and then you get
this linear form in terms of sufficient statistics. And the parameters of the
distribution.
So this is a sum of every variable of every instantiation to its parents, every
instantiation to that variable and every other instantiation to that variable.
XI prime does not equal XI.
Oops, wrong button. Okay. So other questions as a machine learning person you
might be interested in are how can you learn such a process. How can you
estimate such parameters. So let's assume I give you the structure and I just
want you to estimate the local rate matrixes, the Q matrixes for that's
trivial. Basically, you have a bunch of multinomial distributions, a bunch of
exponential distributions and you just go read off the parameters from the
sufficient statistics here. It really is quite trivial.
>>:
You have to know when the switching occurs from one --
>> Christian Shelton: That's right. I'm saying depends. I'm saying if you
have a system in which you have complete data. That is, I observed all
variables at all times, then this is trivial.
>>:
But the example you give --
>> Christian Shelton:
What example?
>>: [inaudible] maybe. I don't know. I might know those, I might not. I'll
cover that section in a moment. So the structure here is also particularly
14
simple. So unlike a Bayesian network, in which learning structure is a
somewhat difficult process, okay, it's not true for a CTBN, because cycles are
okay. The whole thing that makes Bayesian network difficult is that you can't
allow cycles. Therefore, you have to search among the set of acyclic graphs,
which is not a nice set to search under.
I don't have to certain under that set here. So if I bound the N degree of my
graph, there's a polynomial time algorithm for searching for the best graph,
and I find the global maximum, because I can consider each variable's parent
set independently and just optimize it independently. In fact, you can do that
for Bayesian network too if you allow cycles.
So for incomplete data, that is, there are at least some time point at come I
didn't observe some variables, right. There might be variables I never
observed. There might be variables I didn't observe for specific periods of
time. I might have only sampled it to some regular rate, but I didn't know
what happened in between. Then to learn parameters, you need to use
expectation maximization works. I have to estimate the expected sufficient
statistics, and I'm back up there. And I'll talk about that a little bit in
the next slide. And actually, the structure is not too bad. So structurally
works for Bayesian networks too. It's a little more of an art. It's not so
bad here mainly because the structure search step is intact. I get to the
global optimum so I don't have to worry about as many things. I might be
running off here and then maybe I didn't find the global optimum and how do I
trade off iterations of the structure search versus iterations of my E-step and
things like that. You don't have to worry about that as much. Not to say that
you couldn't worry about it, but you don't have to.
Okay. So how about for inference. So this is the task of I give a partial
trajectory and I want to infer in some sense what happened when I wasn't
looking. Or where I wasn't looking. Okay. So I think I mentioned before, the
marginalization produces non-Markovian processes. So you can't just do sort of
variable elimination style algorithm, because the result of your representation
size will grow without bound, as you do that.
So furthermore, if I'm trying to do filtering, I can't just push the
distribution forward over time, because as I push over any interval, suddenly
all the variables become tied together. Just like what happens with
entanglement in a DBN. So actually, these things happen in discrete time
models too, like in DBNs, it's just that they aren't as apparent. It looks,
15
DBNs look like Bayesian networks, and Bayesian networks are nice this way. So
they look a little bit better. But when you actually start like working on
DBNs, you find you have all these same problems again. So it's not like I've
introduced new problems. I've just made them sort of obvious from the
beginning.
Okay. So you probably have a favorite approximate inference algorithm.
Hopefully, it's on this list. And somebody has done it for continuous time
Bayesian networks, and that's all I'm going to say. So expectation
[indiscernible], important sampling, particle filtering, Gibbs sampling,
general markup chain Monte Carlo, mean field, belief propagation recently and
then this one's a little bit special too, continuous time.
So I don't have time to cover all of those, and you don't have patience to
listen to all of those, I assure you.
>>:
[Indiscernible].
>> Christian Shelton: I don't know if that makes inference any easier. The
structure learning is simpler because of that, because I don't have to run the
graph, but if I'm given a graph and the parameters, estimating what happened
when I wasn't looking, I don't know if the cycles make things easier or worse.
I don't think it changes it much. Okay. So I want to get into a little bit
behind this one and this one, because I think they show some interesting things
about continuous time processes, and I'm going to show them on a bit of a high
level.
So one of them's mine and one of them's not.
you know, egalitarian about this.
That makes me feel sort of more,
So filter, the first one is mine and it's filtering so I've now decided to turn
the time axis on its head. So here I have three variables. What does
filtering look like? Filtering is I want to maintain a distribution over the
state of the system given everything I've seen thus far. So I start with some
distribution of where I think the system started. I'll represent it like this.
And then at some real value time later, I observe this state's blue and that
state's red. So what do I do? I need to propagate this distribution forward,
and then I need to condition it on the evidence. This is a standard
propagating forward.
16
I do have animations. Who put that in there? So then later I'll observe
something else, I'll propagate that forward, and then I'll condition that and
I'll continue on. So I might have some other evidence. I'm only going to talk
about point evidence. This works for non-point evidence, but let me just say
it works and we'll move on.
So there are a couple things to note here. The conditioning is standard
distribution conditions. You have some distribution, you just don't allow it
to be certain values and you renormalize. This propagation is by the matrix
exponential. If I represent this as a joint vector, I just multiply it by the
matrix exponential and I'm good.
So this is essentially the step I want to concentrate on. So I'm not going to
calculate the matrix exponential directly. I'm going to instead calculate its
premultiplication by vector, because it's more numerically stable. Much like
it's better not to take a matrix inverse, but instead solve a linear system for
the particular thing you're going to multiply your matrix inverse by.
So the question is how do you compute that? And Moeller and Vanlone have this
great paper it's called 19 dubious ways to calculate the matrix exponential.
In fact, it's such a good paper that 25 years later, they wrote 19 dubious ways
to calculate the matrix exponential revisited 25 years later, okay? If you're
really interested, you should read it.
It basically says there's no good way to calculate the matrix exponential.
It's just not one of those computations that's amenable. So the Taylor
expansion is most obvious thing and it's unstable. Why is it unstable? The Q
matrix negative definite. Negative diagonal elements [indiscernible] negative
definite. So I have a Taylor expansion that alternates signs and we know you
don't want to estimate something with Taylor expansion that alternates signs.
So I'm going to show you essentially how to use uniformization to solve that.
There are some other methods, Krylov subspace approximation and this
integration, which we've played around with. But I'm going to build this off
the Taylor expansion uniformization. So let me talk about that just quickly.
Idea think this is interesting.
So I'm going to take my continuous time system, and I'm going to convert it
into a discrete time system. But there are a couple different discrete time
systems you might be thinking of. I could imagine I could choose to
17
essentially discreteize time at some rate and calculate the equivalent -- okay.
I'm not going to do that. The other is I could talk about the embedded Markov
chain. That is, I don't care when things happen. I just care what sequence of
events happened. I'm not going to do that either. I'm going to build a
different one.
So I'm going to let my Q matrix be equal to some scaler. That doesn't look
like the -- yes, that's correct. Some scaler times a stochastic matrix M minus
the identity matrix or, put differently, I'm going to build a stochastic matrix
M by taking Q, dividing it by some scaler and adding the identity matrix.
And provided my scaler is greater than or equal to the absolute value of the
biggest element in the matrix, the resulting M matrix is a stochastic matrix.
It amounts to the system in which I sample times from an exponential
distribution with rate parameter alpha and then at each time, I sample the next
state of the system from the stochastic matrix.
So I can have self-transitions on that stochastic matrix, because it might be
that that wasn't enough time to actually get a generation of a next event.
Okay. I didn't come up with that. That's old.
So now, if I have P to the EQT, that can be broken up like that. So in
general, this is sum of two matrixes. In general, E to the A plus B is not E
to the A times E to the B. Alas, [indiscernible]. But if they have the same
eigenvector structure then they are and ICE has any eigenvectors you'd want to
these two [indiscernible] like that. So this is a scaler, this is a scaler.
Just precompute that and then this here, I can do with the Taylor expansion on
M and now M is positive deaf any of the and so this doesn't have alternating
signs and I'm okay.
So far so good. That's great. So the essential calculation then is I have the
first element is P, the next element is P times M, the next element is P times
M times M and then P times M times MM. So I essentially need to compute this.
Remember, M is big. M is 2 to the N by 2 to the N, so I don't want to do that.
So in fact, even if Q has compact structure, M will have the compact structure,
but multiplying by it will destroy any structure that might have been in V so
you might have had some nice structure in V, multiply by M and it will
essentially destroy that. Okay.
So there are a number of things in the Markov chain literature that deal with
using some sparse representation, which is great for tightly coupled systems.
18
I want to yo a factored representation, one more similar to, like, say the BK
algorithm for Bayesian networks. And that's good for sort of more loosely
coupled systems. So let me show you basically what happens.
I have P. I want to compute P times E to the QT. Here's an exact way of doing
that. Well, if I have an infinite computation time, I take P and multiply it
by M and multiply that by M and multiply that by M and then I sum all those
guys up with the appropriate weights from my Taylor expansion.
There we go. Okay. So I'm not going to do that. Since I've been filtering
for a while, I don't have an exact answer so I'm going to start with some
approximate answer here, I'm going to multiply by M, but then the result is
going to be too big to represent. So before I even compute the result, I'm
going to project it back on to the space of distributions that are completely
factored.
And then I'm going to continue to use that, multiply and project, multiply that
and project all the way down. And I'm going to sum all those up. And the
question is, I started with something -- this is what I wanted. This is what I
started with. This is what I wanted to do. This is what I actually computed.
Can I say anything about how these two things relate to each other? Can I
bound some sort of error here? And the error shows up in three cases.
So there's some error that started off, I'm going to talk about the KL
divergence error, okay. The KL divergence error in expectation goes down as
you condition on things, unlike the L2 or L1 error, okay. So I'm going to be
conditioning. So there's some KL divergence I started with here. This step is
an approximation of that step. So I introduce some error there.
I introduce that error multiple times. I sum up a bunch of these things, which
also introduces some error and I didn't do this for an infinite amount of time,
which I was supposed to do, okay. The saving grace here is that M is a
stochastic matrix. So with any stochastic system, over time, it tends to
couple. That is, it loses its memory.
So that means that as I multiply these things together if I start
something that's approximate, then over time I'll actually end up
same thing. If I let a stochastic system run for a long time and
robotic system, I'll end up in the stationary distribution. I've
forgotten where I came from.
with
towards the
it's a
completely
19
Okay. So that means that if I propagate through M, there will be some -that's supposed to be a sub script. There will be some contraction rate by
which my KL divergence shrinks. And the projection here can be bounded boy a
constant, and so essentially, got news is that if I have a multiplication and a
contraction rate and a constant error a geometric series converges.
So the complete bound looks like this. That's lovely, isn't it? Yes, okay.
So let's see. I started off with -- you don't want to see the proof? Okay.
So I started off with KL divergence I began with. That contracts by some
global contraction rate. This is gamma prime, not gamma. I'll explain the
difference in a moment. There's an additive factor here, and then this just
comes out from the fact I truncated the Taylor expansion. This term, in
practice, is very, very small. At left if you're willing to spend a little bit
of time at it. So there are basically two questions here. The first is what's
gamma prime, and why is it not gamma the contraction rate for the whole thing.
And the second is, why can't I just use the Boyen-Koller analysis for DBNs,
which essentially does a similar thing. And let me see if I can just quickly
say what it is.
So the contraction rates. You can think of each local variable having its own
contraction rate. So we're going to build off of that. And the reason I can't
use the DBN thing directly is that when I do this uniformization, I don't end
one a DBN. I end one a mixture of DBNs. And so Boyen-Koller doesn't exactly
apply there.
So one interesting thing is that the per-step contraction rate scales as one
over N with the number of variables. It's actually not good. But if I take
the entire process of pushing this forward, the whole process contraction rate
does not scale with the number of variables. It's constant.
So the details are in the paper. Okay. So let me talk about somebody else's
work. Yes, I'm good. So we talk about somebody else's work. So a mean field
is this other method, right, for [indiscernible] approximate distribution, and
I sort of, I take this distribution, this is a distribution of all the
processes and I approximate it in a factored form. So it the product over a
set of local distributions.
So in their work, they represent each of these Qs as an inhomogenous Markov
process. So there are a number of different ways of parameter rising in
20
homogenous Markov processes.
particular time is a vector.
This is the one that works for them. Mew at a
It the marginal distribution at that time.
And the other natural thing would be we have the local Q matrix at that time,
which is a function of time. That's why it's inhomogenous. But instead, what
they do is something a little different. It's sort of like the density of
transitions. And I'm not going to get into the details exactly why. But it's
certainly related to Q. Okay. So the algorithm is you pick a bunch of Qs.
You hold all the others constant, you pick one of them and you try to maximize
or minimize the KL divergence between your approximation and the true
distribution. It's a variational approach.
Okay. So then you work through a lot of math and what do you get? You get
that your new mew I -- I'll just do the mew I. I won't do the gammas. Your
new mew I at the time is some function of the current mew I, your gammas that
you've already computed and then sort of the processes in your Markov blanket.
Okay. So why is this good?
This is a differential equation I have to solve, but that's good. Because
again, I can use some sort of adaptive integration method here. So I pull out,
you know, this particular process, and to go estimate its describe
distribution, I do some adaptive integration. That means at certain times, I
take large jumps. At other times, I work down small and be approximate. This
also means that each variable has a different adaptive integration associated
with it. So some variables I can reason about very quickly. Other variables,
I take time and carefully reason about them, which is good for most systems. I
have the weather which evolves at a much slower time rate than the traffic that
I'm trying to estimate, than the actual individual vehicles on the road.
Okay. So this representation here events ends up being naturally adaptive by
variable, by time, and so you're going to save computational effort. Now, I'm
not saying you could not do this in a discrete time model. But I think it
would be much harder to try to figure out how to do it. It wouldn't be as
natural to try to reason about it. I'm going to jump four time steps ahead or
five time steps ahead. You could do it, but it's not as natural.
Certainly, you'd have to take integer jumps, integer jumps.
about two ways in which we've applied this. The first is to
monitoring. So I have a bunch of computers. They're hooked
and what I want to do is I want to put something on the nick
the packets that come in and out and tell you whether or not
So let me talk
network
up to a network
here so I analyze
you currently have
21
some malicious thing running on your software or some software running on your
laptop.
Okay. So I'm going to build a particular CTBN. I'm not learning this
structure, I'm fixing it. Essentially, I'm going to take the traffic and I am
going to separate it by destination ports. We'll assume these are not servers.
These are clients. So you know all my web traffic to 80, all my web traffic to
some other alternate port, all my DNS traffic, et cetera, by ports. I think
I'm going to pick out the top ten ports, or nine ports and one catch-all for
everything else.
Other than separating it by port, I am not going to care about anything except
the exact timings. So I'm not looking at payloads, other than to port numbers,
destination port numbers. I'm going to build this as a plate model. So I
assume that the traffic in general from your computer is generated from some
hidden node that has four states. I'm not going to give any semantic meaning
to those. Those are just states that can couple things over time.
For each port, there are N ports for each port. There's a hidden variable
dictating how that traffic's being generated. And hanging off of that hidden
variable, I have four variables, one to indicate packet came in, one to
indicate a packet came out, one to indicate a connection was started, one to
indicate a connection was stopped.
>>:
First two, the precise time when the packet is observed?
>> Christian Shelton:
in, packets out.
That's right.
So these are timing events here.
Packets
So we looked around, we found two datasets. The MAWI data set is some pacific
backbone data that comes from Japan. And this study, we assumed that we took
the -- you see, we took the ten most active IPs. We assumed that that's all
the traffic that's being generated from that IP, which is clearly false from
this data set. Anything that came from Japan, we didn't see. Anything that
went to somewhere else in Asia, we probably didn't see. LBNL has some
enterprise traffic. I don't know what enterprise network it's from. It might
be LBNL's network. They might have gotten it from someplace else. I can't
remember.
So we took that.
This is at the router's inside the networks.
So this is
22
probably a reasonable approximation of everything that happened for those
hosts. We, I think, split the data 50/50. We trained on the data assuming it
was clean. So we built a model of what the normal traffic is that comes out of
this computer. We took the test set data. At certain periods of time, we
inserted worm traffic from running a worm and gathering the traffic comes off
and inserting it in there. Now, these worms are pretty easy to find. They
tend to just go [indiscernible] and spam a bunch of packets so we scaled them
back down to one percent or 0.1 percent of their natural running rate so they
blend into the background to a more difficult problem.
And own over a sliding window of 50 seconds, we calculate what's the
probability under our model of that 50-second window of events conditioned
everything seen thus far, okay. And if that probability is too low, we say
that's abnormal, that's strange. Something strange napped this window.
Okay. So here are our -- let's see. They're ROC curves. So here are ROC
curves. These are the two datasets. These are three different worms of
various forms. Our line is the black line that's on top. I'm shocked. Okay.
So that's our model. Notice the false positive rates here from go from zero to
0.1, and the true positives go from zero to one. We compare it against a
number of other standard machine learning techniques. This dashed green line,
which you may or may not be able to see, is nearest neighbor based on some
features proposed in the network literature. Actually, it was a paper in the
network literature.
The one that actually beats us at one point is a connection counts, just count
the number of connections and if it's too big, definitely something strange is
happening.
Let's see. This is a parse and density window estimator, sort of built on the
same thing as nearest neighbors, and the purple one is an SVM with a kernel
designed nor sort of anomaly detection method.
Okay. So there's an example of using those to detect network traffic. We've
also used it to detect where it came from. So we took the same ten hosts and
then we took a 50-second window of traffic and asked it to say which, under
which host was this most probable. So imagine they all sit behind nat, right,
and we can fairly accurately describe which host it came from. So we can do
host identification.
23
>>:
Can you explain why for this specific worm you --
>> Christian Shelton: Yeah, so why did the LBNL and the mydoom worm? Yeah, we
looked at it. It wasn't entirely clear to us. I agree that's strange. And we
couldn't figure out -- why did we want to know? We want to know because then
we can prove our method, right. What is it we don't know? It wasn't clear
just exactly what that combination was doing there, because you notice if you
unilaterally change either of the two dimensions, you do fine. It wasn't clear
to us. There were millions of packets that we couldn't go look to them all,
but at least initially, we don't know.
>>: If you use up the [indiscernible] time discrete decision, do you prove
something close to this.
>> Christian Shelton: Sure. So the CTBN is the limit of the DBN as the time
width goes to zero. But it's computationally more efficient, as time goes to
zero, your computational time also blows up here.
>>:
[inaudible].
>> Christian Shelton: Here, you mean like this? So the answer is essentially
that as you vary the threshold, you suddenly grab a bunch of -- you suddenly
grab a bunch of the traffic and that's all you can get. That make sense? So
some of the time windows, the same threshold will instantly push you across
them. That wasn't helpful. So the threshold is on the probability of that
window, right. So certainly, I drop the threshold over here, the probability
is really, really high. Really, really low, sorry. The probability is really,
really low. As I increase that probability -- it's one of the two. As I
increase that probability, basically I move from here to here. There aren't
sort of -- there aren't very many operating points in between.
>>:
But a simple computation, what [indiscernible] comparable complexity?
>> Christian Shelton: Yeah, that's a good question. We didn't do that. So
one thing you'll notice is that you have to use an approximate inference mode
here. We're using a [indiscernible] block wise particle filter, actually. So
I didn't tell you that's another one you can use. And if I went off and
implemented the sale thing in discrete time, you could have distinct qualms
about how I chose to implement that particular one. So I don't know what the
equivalent one.
24
But one thing is the rates here do change drastically. The computer's off, the
rate of events happening is very, very slow. Then the computer comes on, the
rate of event happening is very, very fast so you'd have to have a pretty small
time window to capture a lot of the stuff that happens here, because there are
times when you go, I'm really capturing every packet, whether or not they're
only microseconds apart or millisecond, at least, apart. And if I wanted to
time slice at that width, this would be intractable in a DBN.
>>: The question is [indiscernible] is the performance approaching this
performance?
>> Christian Shelton: Yes. As I said, certainly, if I took the time slice
width to go to zero, and I had -- I don't have that much computation time this
year. But if I did, I would get these results. The CBTN is truly the limit of
the DBN as the time slice width goes to zero.
>>: It's very important to know whether the time really pays off under the
comparable computational ->> Christian Shelton: What I'm saying is the only comparable one I have here
is I time sliced it the smallest width between two events. So I have events.
If you I have a packet emission and another packet emission, if I want to
capture them both in the DBN model, I have to sample at a rate that narrowest
width. If I sample at that rate, there's no way I can compute this in a year
or two.
>>:
You increase the sampling rate but you aggregate between the sample times.
>> Christian Shelton: So I can do that. So I can try to aggregate. Then I
have a DBN that's a little different, right. Then I'm saying it's Markovian in
the number of samples that have happened between here and here, and then you
have a different model than I have. So then I can't -- if you're just talking
about a comparison on a computational point of view, I can't make a comparison
there, because you're saying it's Markovian in the number of samples that have
happened in the past time width, and I'm saying it's Markovian, right, in this
global state that went on. So you really have a different kind of model.
This is the hard part about, like, comparing the two is if I time slice it
finely enough, I can't compute the DBN one. And if I don't, if I do something
25
like that, now we have sort of Markovian assumptions and, you know, yeah, we
have other problems comparing.
>>:
[inaudible].
>> Christian Shelton:
Yes, so all of this was taken from one week.
>>: So I don't understand how you can say the traffic different times of the
day might be very different. How would you detect that?
>> Christian Shelton: So that's why the hidden variable is here. So we don't
automatically do anything about it. We're hoping the hidden variable captures
that kind of semantic meaning. That is from the past window, I'm going have my
current state, have some estimate of, let's say, G that captures the fact that
it's currently, you know, Monday afternoon and things are different Monday
afternoon.
>>:
So were four states enough to capture --
>> Christian Shelton: I'll say, yeah, four states were enough and it was also
enough that we could do the computations. So it was this balance between
expressability and computational power, yeah.
Now, again, this is traffic across, I think it was a week. It might have even
been shorter than that. So I don't want to claim some broad thing about, yeah,
this would work across months or something like that, yeah. This is also done
about four years ago when our ability to do exact or approximate inference and
exact inference wasn't as good. So I think we can crank these numbers up now
with better numeric algorithms, yeah.
So last one is social networks. So a lot of people look at static social
networks. In fact, a lot of really smart people looking at static social
networks. So I don't do that, because I don't want to compete with really
smart people.
So actually, a lot of smart people are looking at dynamic social networks too,
but there's fewer of them. So here's the idea is that I'm monitoring the
communications, let's say, in a social network. Either I see people's emails
or I see people's phone calls or I see people's Facebook postings or whatever
it is, okay, depends on what institution you live in, which is a reasonable
26
model.
And what I want to do is I want to estimate the changing underlying social
network. Okay. All right. So what we do is we basically build a generative
model of the social network of the actors' internal parameters and of the
observed communication patterns. We take that model, we conditioned on the
observed communications we actually saw and we try to reason about what the
social network might be.
So we call this the hidden social network model. It's built on some work in
sociology so sociology has been looking at social models for a long time. And
they even have continuous time Markov models of how social networks might
change. We took the one from Snijder's. It's the network attribute
co-evolution model. So it essentially says that the network evolves.
So links between two people change based on the attributes of those people. So
if I smoke and you smoke, then chances are we'll form a -- there's a higher
chance we'll form a friendship than if not. And my internal attributes, like
whether or not I like football might change based on whether or not my friend
like football.
Okay. So the network attribute co-evolution model broadly looks like this, two
kinds of variables, YIJ is whether or not there's a directed link from I to J
at a particular time. And ZI is whether the attribute of actor I in a given
instant. Yep.
>>: How do you define the network that there is [indiscernible] so definitely
if we talk now, there is ->> Christian Shelton: So I'm going to define the model, but I'm not going to
observe that variable. Make sense?
>>:
So how would you verify it?
>> Christian Shelton: So I'll talk about the verification in a moment. It
gets a little tricky, yeah. Yes, in fact, I'd love to have a better dataset in
which to do it. But I'll show you what we can do.
Okay. So the model from Snijder's, best described as sort of a forward
sampling model. Every actor has a rate of change. When their rate comes up,
27
you know, the event fires. They look at their current network and their
current attribute, their local like who they're friends with and their local
attribute and they consider any unilateral change. So I make or destroy one
friendship, or I change my attribute by one value.
>>:
This is the [indiscernible].
>> Christian Shelton:
>>:
This is continuous time.
So one person will be one node?
>> Christian Shelton: That's right, yeah. So I compute those utilities. Some
are bigger than others. I put them essentially into a Boltzmann distribution,
which is basically a soft max, and I pick the one that's essentially, including
the one I'm currently in, essentially soft max. So if I'm currently in a local
minimum or maximum in this case, I tend not to move away from it, but I might.
That's the model.
The only question is what does this utility function look like. He essentially
proposes it should be a linear function of some things and the ones we use our
popularity number of human links, similarity of your attribute to your friends,
stuff like that. Okay?
All right. So essentially, I have one variable for every possible lipping in
this network. So they're N squared variables. So you recall I just have ten
actors. That's roughly 100 variables. They're all binary. That's a state
space of two to the hundredth. So I'm definitely not representing this thing
exactly. All right. And then to add communications, okay, so there's a CTBN
that describes the relationship between these. It's kind of hard to describe.
It's essentially involves contact sensitive independence. So I'm not going to
describe it, but it essentially amounts to a CTBN, the social model I just
described.
So what I do is add a communication variable here. And it's tied only to these
two. So this is the communication pattern between I and J, and it depends only
instantaneously on whether or not I considers J to be a friend and whether or
not J considers I to be a friend.
So this might be, you know, they might have a few states like they're calling
each other, they're not calling each other, send a text message, sends an
email. You know, that sort of thing so you have a number of states about what
28
the communication is at any given instant.
So these change fairly rapidly.
time.
These change certainly much less rapidly over
All right. So here's the dataset we used. This is the reality mining dataset.
We actually used the first version of the dataset. There's a second, more
complete version out.
Essentially, some people at MIT convinced a number of students to put on their
mobile phone an application that monitored when they took particular -- who
they called when and when they sent messages. Actually, monitored a bunch of
other stuff too, but we're ignoring that part for this one. That was over the
course before a year. We chose everybody in there had sort of essentially had
a valid phone number. We don't know the phone numbers themselves but the data
was kind of inconsistent in some way, we threw out anyone who was kind of
inconsistent.
We resulted with 25 people from Sloan business school, 54 people from the MIT
media lab. This is not surprising. Those were the two groups involved with
setting up the study. And then 13 people who we don't know their affiliation,
because they were not enrolled in the study. So these people who did not
choose to be part of the study, but more than one person they knew chose to be
part of the study and called them at some point. Yes? This is important to
understand. So nothing was running on their cell phone when it happened, but
some of their friends were blabbing about what they were doing, okay.
So we only used these phone calls and text messages. So we learned a bunch of
parameters. We take all that data. We only observe the communication
patterns. We do EM to estimate the parameters, and I'm going to give you
parameters and then we'll do something else with it. So first week of the
network dynamics, this is from Schneider's model. We get the rate of changes,
everything here is in units of days. Well, this one is. These are just
unitless numbers. So this essentially states that you don't tend to make
random friend. So all of the things, you're intending not to propose a
friendship with someone random. This says that you really tend to propose
friendships to people who are already friends with you and that activity and
popularity, which are sort of measures of the number of people connected to you
who are connected to somebody else are not as important.
An interesting thing is we've tried this same model on other kinds of datasets
29
so there's one that has some panel data where they interviewed or surveyed set
of teenaged girls, like early teenaged girls in some school somewhere in
Europe. I think it was Scotland. Can't remember where. And they asked them a
number of attributes, like year one, year two, year three to see how the
friendships change.
We actually get kind of similar numbers out here.
interesting.
>>:
That was kind of
Here the [indiscernible] means the communication.
>> Christian Shelton: This is the rate at which you propose a change to your
network. So that means on average, once every 40 days is what this is.
>>:
So that's part of the Q matrix?
>> Christian Shelton: It's related to it. The relationship is a little
complicated because it involves this. Then when you go to make a change, you
then score any change you can make. I could drop you as a friend, I could add
him as a friend. I could drop you as a friend, or I could add him or add him
or add him. I consider one unilateral change. I score them all, and then I
roughly pick the one that's max, that gives me the back score.
And the question is how do I score them. And the answer is the resulting
network I score according its density, the local network density, the
reciprocity, the activity in the popularity, I combine them with these linear
weights.
So this says I tend to prefer things that have -- I tend to move to networks
that I have reciprocity in.
Now, the communication pattern, these are the rates of the communication
pattern. So that's the rate for the underlying social network. Here's the
rate for the communication patterns. So this is communication from K to L.
This is whether or not neither consider each other a friend. K doesn't
consider L to be a friend, but L considers K to be a friend. The reverse and
they both consider each other to be a friend.
And so if we'll just take this line here, this essentially means that the
average time run expectation, they tend to contact each other once every three
30
to four days. 80 percent of those are phone calls, 20 percent are text
messages. This was 2004. Text messages weren't as popular then, I guess. And
the average conversation here, this is the end rate for a conversation. The
average conversation ends in about five minutes on average, right.
And you notice the rates here different by huge numbers of orders of magnitude.
I'm not going to be able to capture these things very well efficiently in a
uniformly time slice model.
Okay. So then if we fix these parameters, we can go back and ask, okay, what's
our estimate, essentially a smooth estimate. I have all this observation.
What's my estimate at this time step of what the social network looked like?
Okay. And so here's the estimate, for instance, at August 19, and November 17,
actually at midnight because it's a continuous time number. And the February
15th.
And now here's what I'd love if they'd gone back and asked people what their
friendships were so I could go validate it and I don't have that information
from this dataset, and it's hard to find a dataset that has good information
like that. So all I can say, doesn't this look reasonable, and it's not a very
-- yeah, it's not as convincing. I'll perfectly admit that.
So one thing to note is that the algorithm did not know these groups. And we
can see they have more dense, I mean, they were just all given a random ID. So
the algorithm tends to cluster, you know, the Sloan people know each other and
the media lab people know each other. Business school students are more social
than media lab students.
There's a selection bias there, right. And furthermore, these 13 people who we
don't know who they were, they seem to be related to the Sloan business school
students.
But here's something interesting. I'm estimating social network connection -this is a heat map. I should make it clear. White means we're pretty sure
there's not a connection. Black means we're pretty sure there is a connection.
Orange is reasonably sure and yellow is not so sure, okay? I'm estimating here
friendships among two people whom I've never observed the communication pattern
between them, because their phones were not monitored. I've only observed when
they called someone in this network.
31
Furthermore, the people who were in this study came in and out of the study.
It not like I observed them continuously over a block of time. So went and
performed this study here and dropped this study here. I'd love to be able to
verify these. So how do I do that at all?
It's not that I observed they didn't communicate to each other. It's I didn't
observe whether or not they communicated to each other. But I have certain
notions of what social networks should look like in terms of reciprocity, in
terms of communicativity and stuff like that in terms of the social network and
that at least gives me an estimate here. Now, I don't know how accurate that
estimate is. These are really hard. They didn't even agree to participate in
the study. I can't go track them back down. But I'd love to know whether or
not, you know, because there's this one here that's solid.
>>:
You have the phone numbers, right?
>> Christian Shelton: No, I have anonymized versions of their phone numbers.
They got, you know, they got hashed, one-way hashed on to some number. Call
them up. Hi, back in 2004, did you happen to know somebody with this phone
number? Oops. Oh, no, that was terrible. Hang on. I'm essentially done here
so let me just go here. Okay.
So I'll give two plugs. One is the code for, almost all the CTBN algorithms, I
have available on a website. We're hoping to release a new version of it soon.
The current version that's there is not as numerically fast as we'd like. We
completely redid the whole matrix package with Eigen, which is a pretty fast
matrix package and works much better. I'm giving a UAI tutorial on continuous
time processes along with Gianfranco Ciardo, who is a professor in verification
to give sort of the other side of this. These kind of models have been used in
verification, Petri nets, so that sort of stuff. We'll do a tag-team kind of
tutorial. See some of the same slides, but not all of them at UIA.
The last thing is I've tried to at least argue the case to avoid time slicing.
There's certainly some cases where your data is naturally time sliced. So if
you want to model the daily high temperatures, there's a natural time rate for
that, right? Day by day. In fact, it isn't a continuous sort of thing, okay.
So there are certainly cases where discrete time is the only way to go. But if
the underlying process is continuous time, I think at least you should at least
admit it, just like, you know, I'm going to go implement this algorithm in a
computer, but I nevertheless think I have infinite position floating point
numbers when I go to analyze the algorithm and develop it.
32
>>:
[inaudible].
>> Christian Shelton: That's harder to say. You notice I haven't covered
continuous state. So it's a discrete state. So we've done some work on
continuous state. The common filter has a continuous version and these sorts
of things are stochastic differential equations, and, you know, the classic
option pricing is built on stochastic differential equations. They do it
exactly this. They treat it as a continuous time process, yeah.
>>:
[inaudible].
>> Christian Shelton: I've only done discrete time. We can talk later about
the continuous state one, but finite amount of time in the talk.
>>: You probably answered this throughout the talk, but I'm not very fluent in
this. So you've made a good case against sampling time because things have
been rapidly [indiscernible] for a while. But what's the objection to thinking
of a model where just every time an event occurs, that's my time.
>> Christian Shelton: So you're saying what I could do is I could build, it's
called the -- the underlying Markov process that's on the -- the skeleton. The
underlying skeleton, right. You can do that. The question is you might care
about the timings. I might care -- so you say I go from state one to state two
to state one to state two to state one to state two. But I might care that
when I'm in state one I stay there for three times as long as when I stay there
in state two.
>>:
Okay.
[indiscernible] of the variable, feature of your --
>> Christian Shelton: You can do that. You don't end one a Markov process.
You end up with maybe a -- yeah, semi Markov process or something else like
this. One of the large draw backs behind Markov processes is discrete time or
continuous time, is that the dwell time in a state either has to be
geometrically exponentially distributed.
The geometric exponential distribution is the only self-similar distribution.
If condition on having been in this distribution for this amount of time, the
amount of time remain is still the same distribution. That's what it means to
be Markovian. So if you want something that is not that way, you have to move
to at least the semi-Markov process. We've done that a little bit. I haven't
33
shown this here. We've done things where the rates vary cyclically say, based
on the time of day.
You can incorporate that in here to make sort of these semi-Markov processes.
But yeah, if all you care about is the sequence of events and not their
timings, then that's right. Then you definitely have a discrete time problem
and you should treat it as that. I'm not going to argue against that. I don't
know if that answered your question or did I successfully skirt around your
question? I'm not trying to skirt around it.
The other thing is I might observations -- here's the other way. I have
observations that are tied to times, usually. I ask a censor what's his value
of a particular time. I don't ask how many events have happened since the
beginning of time so I know how to put you in a timeline. So you're talking
about a model in which if I made observations, I need to know the number of
events that have happened. Whereas it's more naturally the amount of time
that's happened. Maybe that's a different answer also.
>>:
How does this all this relate to this kind of point process?
>> Christian Shelton:
processes.
>>:
You can build these things off of plus on point
So the [indiscernible] you talk about --
>> Christian Shelton: Is directly related, yeah, to, yeah, points on point
rates, yeah, um-hmm. Yep. Any other questions?
>>:
You mentioned your [inaudible].
>> Christian Shelton: How big can I build this up to? Well, it depends. The
social network, right, has a very large state space. If you don't include the
communication variables, you just include the other ones, because communication
variables I essentially always view, then I've got the state space is two to
the 97 squared. 97 people, okay, 97 squared minus 97, okay. Possible arcs and
two to that is the state space there.
That's big. I'd argue that's decent sized. We're doing sampling on this case
and there's a lot of internal structure. Essentially, a few rates are
governing a lot of what happens. I assume people are essentially homogenous,
34
okay.
For things in which that's not the case, you can do exact inference. You can
do exact inference for, oh, at least somewhere between 10 and 15 variables,
okay. And then how well you can do approximate inference after that sort of
depends on how much time you're willing to throw at it. And what fidelity of
your answer you need so you can go up to somewhere between 10 and 100 variables
easily, sort of depending on that. And beyond that, you probably have to rely
on something else currently.
I'll say also that each
to make our approximate
in a couple years I cap
without a problem. But
software.
>>:
year, we get a little bit better about figuring out how
methods a little better. So I wouldn't be surprised if
come back and say we can do a thousand variables
currently, that's probably not feasible for our
Can you parallelize it?
>> Christian Shelton: The sampling ones are ridiculously easy to parallelize.
Yeah. Some of it can be parallelized, yes. So some of the learning can be
easily parallelized. That the first one I showed where you're pushing forward
and getting the approximation that's an approximate something approximation,
that one the matrix multiplications you can do parallelism on it. It's at a
much finer grain, harder to do, yeah. It would be -- it would certainly be
feasible. We haven't yet looked at that, but yeah.
Great.
Thank you very much.
Download