>> Asela Gunawardana: So hi, I'm Asela and I'm going to talk to about models for temporal
event streams. This is of course work with a bunch of great collaborators. To start off what do I
mean by temporal event streams? I simply mean streams of events that have both a
timestamp and a label, so in this little cartoon I have kind of a simple one. There are two kinds
of labels, blue and red and over time the events occur and some of them are blue and some of
them are red. This is a cartoon view of a bunch of kinds of data that we actually care about like
search query logs, webpage visits, Facebook check ins and data center event logs. For example,
with search queries instead of having just blue and red we would have a whole bunch of
different possible labels corresponding to different queries, but they all have timestamps.
What we want to model is what events are going to happen and when they are going to happen
and this would depend on what has happened in the past and when it happened in the past. So
it's not just what, but when and I'll keep repeating this and hopefully you will eventually see
how this is different from what's been done before. Why do I want to do this? Well, one
reason to do it is to simply understand what's going on in the systems. For example, imagine
that we have a data center and all of the machines in the data center log all of the errors that
they spew out and that they have some kind of automatic management system that also logs
everything that it does, and the data center is humming along and there are some errors. Then
the management system decides to maybe install patches on some machines and reboot some
other machines and re-initialize some other machines and eventually every so often it gives up
and somebody's pager goes off. And they come in and they get in little golf cart they go to the
machine and they actually putts around with it and that process is actually really expensive. I
might want to know why I was sending a technician to the machine. If I look at the error logs
on that machine what I see is a complete mess, a bunch of different events happening over
time, and at about three o'clock in this case we sent a technician to the machine. Well, why?
Well, you could painstakingly go event by event and try to figure out what's going on or if you
had a model that understood the dependencies between these events, it could show you that
these are the two key events in the past that caused the technician to go there. The C drive
failed; the automatic system tried to reimage the machine and that didn't help, so the pager
went off. Another thing you might want to do is try to predict the future, so say that you are
running a search engine and you have users putting in queries so you kind of have an idea what
they are interested in and an advertiser comes to you and says I'm an advertiser in the health
and wellness industry, can you please show this ad? Plaster it on the top of Hotmail, but just
two people that are actually going to care about health and wellness. I don't want to wait until
they issue the query. I want to show it to them the whole week before or after. You want to
figure out which of the users actually is going to care. So you've been watching a couple of
users. One of them searches for Elizabeth Edward’s will and another one searches for pellet
guns and then they search for camo, and the next day they search for online education, and
then they search for Valentines and they misspell it and then they immediately correct
themselves. The other ones two days later searches for snow pictures and then Versailles
Palace, and this is when the advertiser comes to you and says show your users this health
wellness ad. You're sitting there scratching your head saying who should I show it to? So how
many people here think that we should show the ad to user A? Only one person has an
opinion, okay two, user B? You guys may be right. You may be wrong. But let's see what
happens. The first user searches for Amazon and then for chairs on sale, then adoption and the
Dave Thomas Foundation which I had to look up. It actually works in adoption. And then those
of you that said user B, you were right, exercise equipment and high blood pressure. What we
would want is models that can actually make these predictions. Maybe we could crowd source
it, right? [laughter] But then show the ads to the right people. One of the challenges when we
are trying to build these models is that data is really bursty, right, so when the person corrected
their spelling on Valentines and when they search for pellet guns and camo, those queries came
within tens of seconds of each other and then sometimes we didn't see him -- seeing him,
maybe I'm wrong because of the pellet guns and the camo, but for five days, four days, so the
time spans can vary from seconds to days. This is a challenge if you are using discrete time
methods like DBNs, HMMs or Markov chains because well, you could ignore the times, but I
think it's pretty easy to argue here that the times are kind of informative. Or you can decide to
sample every 10 seconds or so so as not to lose any information, but now you've got these long
stretches of tens and thousands of steps where nothing happens. This makes it really hard to
learn the dependencies involved, and it makes it hard to do inference or forecasting forwards
over tens or thousands of steps. So a different approach is to do the modeling directly in
continuous time and there has been a bunch of work that has tried to do that. We actually
tried a few of these. It turns out that a bunch of these methods restrict the kinds of
dependencies you can have on the path. It wasn't clear whether the assumptions that were
chosen for those particular models actually matched the data that we care about. Another
problem is that we want to operate on Bing query logs which are little big. A lot of these
methods don't scale very well to that kind of data, and so what we want is first to make less
restrictive assumptions about the data, and second still be able to scale to these huge data sets.
So what I'm going to tell you about is a technique we came up to do that using a modeling
technique involving conditional intensities and I'm going to take a little detour to explain what
those are. To be a little more formal about the data, you can mathematically what the process
we are looking at is by is a marked point process. It's a point process because it's generating
times and it's marked because each time is marked with a label telling you what happened at
that time. You can write a likelihood from a marked point process forward in time predicting
the time and the label of each event given everything in the past. What's really cool is marked
point process theory tells us that we have this conditional intensity representation for any
marked point process. Those of you who have looked at Poisson processes might, if you squint
a little bit and hide the conditioning on history, it looks a lot like a Poisson process. But let me
quickly give you a sketch to convince you that that representation is possible. Say you just had
a density over times. You could then define the intensity as the density divided by one minus
the cumulative distribution function, the CDF. Now the derivative of the CDF is the density, so
multiplying the top and the bottom by the denominator you get this differential equation and
now you apply to your calculus and you find that the density is just an exponential just with the
time varying rate. Now if you have a time and a label what you do is you look at the density of
the time and the distribution of the label given the time, and you define the pro-label intensity
as the overall intensity times the conditional of the label, and now you plug-in and this is
exactly the same formula we had before except you need to condition on history and pull the
sum and the exponential out and make it an end product and now we have this form. Okay,
mathematically you might say okay, I can go check the math and it's probably right, but what
does this mean, so going to give you some intuition about what this representation means.
Let's go back to a cartoon world where we only have two kinds of events, blue and red and we
don't have any events yet, so the history is completely empty and so what we have are two
conditional intensity functions that I give you for red for blue how fast do you expect red events
to happen at every point in time, and how fast you expect blue events to happen at every point
in time? Here is an example that I just pulled out of the air and drew a line on the graph. Say a
red event happened. Now this is where this differs from a Poisson process, because Poisson
processes have independent increments. These are conditioned on history, so now we have a
red event in history, we condition on it and our expectation about the future changes. You can
see that given that a red even happened what I expect red things to do in the future just
changed but blue stayed the same. Now the red event happens, we update what we expect to
happen and so on. Blue event happens, we update and you could generate all the data and
then you could draw the conditional intensity every time given everything that happened
before. Now there are two things going on with this representation. First, the behavior of the
reds is characterized just by the red curve and the behavior of the blue is characterized by the
blue curve so I can specify them independently. The second is that the red curve, the blue
curve didn't change when red events happened, so the occurrence of blue events is
independent of red events in the past, so we have a way of characterizing our temporal
conditional independence. We have this representation and what we would like to do is learn
these representations from data. In order to do learning we need to make a modeling
assumption, and the assumption we make is that the conditional intensities are piecewise
constant. For a large class of smooth functions you can approximate them arbitrary closely
with piecewise constant functions so this is kind of a nice modeling assumption to make. How
can we learn a piecewise constant conditional intensity function from data? Well, we are going
to do it in two steps. First imagine a genie told me what regions of time the intensities were
constant in. In other words when was it switching? So they colored in the timeline for me.
Then all I have to do is figure out what should the rate for red be in the yellow region, and what
should the rate for red be in the green region? Now that is a somewhat easier problem, right?
Just count up the number of reds in yellow, divide by the duration of yellow and that should
give you a pretty good estimate. In fact, you can choose a conjugate prior for the likelihood of
this process that gives you exactly that, right, with smoothing of course. Now how I come up
with the coloring? Well, we can push this a little further. I'm going to use the symbol s to
indicate a function that colors the timeline because that defines the structure of our model.
Once you give me the structure I just show you how to estimate the parameters. We can put a
prior on structures. I, ways of coloring the line and if we choose the prior carefully we can
actually get a closed form for the probability of the coloring given the data. Well, not quite, up
to a constant factor, but this is good enough because if you give me two different colorings now
I can just compare the probabilities of both of them and pick the more likely one. This is exactly
what we're going to do. We're going to start off with a basis set of colorings. And what I mean
by a basis is that I'm going to start combining these but I want to show you what they do. So
first guy it says I'm going to color the timeline dark if there was a red event up to a second ago.
So the first red event happens, then we start coloring. A second after the burst of red, we stop
coloring and we start again when another color red one happens and so on. The second guy
the same, just shifted over by a second. The third guy looks at blue and so does the fourth. So
these are four different colorings of the timeline and now what I can do is say if I want to
predict reds, which coloring is most useful based on the Bayesian score that I told you a couple
of slides ago. And I pick the best one. Now I can go one step further. This is a function with
two values, right. I can think of it as a binary decision stump. If this function equals one, take
the colored leaf. Otherwise take the uncolored leaf and I can use the decision stump to call it a
timeline. If it is a decision stump I can start splitting the leaves and building a tree. I can take
another function, say the blue one, and I can use this to split that leaf. Well, what does this
mean? Wait. Hold up. The blue one, the blue function is a function of blue events in the
history and what I'm going to do is color the timeline using blue events in the history and then
see does that coloring help me predict red ones? So it's saying the blue events in the history
affect red events in the future. Anyways, how do I split this state? I look at the overlap
between the red cross hatches in the blue regions. That splits that state and now once again I
can look at the score of this coloring. I can choose a different split and look at the score of that
coloring and choose the better one. So I can greedily build up a tree this way and that
completely specifies my model. So here is a completely specified simple toy model. The idea is
we have three kinds of events, A, B and C and they are all bursty. When one happens it will
keep going for a while. The As trigger the Bs and the Bs trigger the Cs, and you can specify
these with decision trees. I'm not going to walk you through the trees. So the first thing we did
was we generated data from this model and then saw whether we could learn it back and of
course we could, but what's more interesting is remember the data center example I started
with. Say we had a bunch of the event logs which we did, we can train one of these trees for
each kind of message that we log and now we care about the same tech event, so we look at
the tree for sending a technician. Now all you need to remember about this tree is it has four
leaves and what we are going to do is we are going to 3 PM, apply the tree and see which leaf
we end up with at 3 PM. It turns out to be the red one in the far right. Now we walk up the
tree and say what events in the past made me color the timeline red there? Those two guys,
and those are the events that now influence my decision to send a technician. Great, so we can
build a model from data and we can use it to understand what's going on in our data. Next step
we want to do is forecasting. Now the challenge is that this is a continuous time model so it
doesn't happen to be a closed form way to do estimates of what's going to happen in the
future. What we want to do is say I care about some future outcomes, say the user searches
for healthcare, what's the probability of that happening? The nice thing about these models is
that they are really efficient to sample from, so what I do is I sample possible futures, sample a
bunch of them and just count what proportion of those possible futures does the outcome I
care about happen. If we go back to that toy example, remember As triggers Bs and Bs trigger
Cs and what's the probability of seeing a C in the first second? I can sample the hell out of it
and find out what the right estimate is and it happens to be 3 to the minus 3, or 3x10 to the -3.
Well, the problem is that it takes a really long time to converge. In fact, for the first hundred
samples we never see that event happening. So I estimate the probability at zero and even
after it starts happening the variances are huge, and this is a standard problem. If the desired
outcome is rare, then sampling doesn't do so well and the standard solution to do, or one of
the standard solutions is to do important sampling. The idea is that you sample from a
different distribution where the outcome you care about is more likely and then you see what
proportion of those outcomes, those samples have the outcome you care about, and you just
reweight the samples to correct for the fact that you sampled from a different proposal
distribution. So the challenge is to define a proposal distribution P ~such that the outcome you
care about becomes more likely where you can still compute the correction factor that you
need. It turns out with our models it is really easy. Say that you cared about some label L in the
first second. Just take your model and boost the intensity of the label L during the first second
as long as the outcome L hasn't happened yet. So what I did was I took a piecewise constant
intensity, added a piecewise constant function to it so I'm still piecewise constant. My sampling
algorithm still works and it turns out that I can compute the correction exactly. Here was the
result from forward sampling. Use the important sampling and you converge much faster and
the variances are lower. So back to the second problem that I was telling you about; which of
these users is going to search for health and wellness related frames on the internet so that we
can advertise to them? So I actually pulled these examples out of the logs. These are actual
test cases that I ran against and this was a position recall over I think tens of thousands of users
over about a month. You can see that taking advantage of this temporal dynamics actually
gives us much better classification. So just some words about what we are doing now is
imagine you have, what we've done is been able to build models of these worlds where events
happen over time and there are some problems where you want to plan to act in this world.
For example, one of the applications that we are looking at is app usage on cell phones. When
people use apps on the cell phone it takes a while to load up an app so you would want to precache them and so there is a planning problem about when do you want to load up or preload
apps for a user? It's important to not only know what apps to load up but went to load them
up, and so we are trying to deal up planning algorithms for this setting. The other is I showed
you the basis function, if a blue event happens in the last second or between a second or 2
seconds ago. Our models are completely dependent on the right choice of these basis
functions, because once you have the basis functions we build trees based on them and we get
estimators. What is the right set? Can we come up with a set such that we can approximate
any conditional intensity? And so that is another direction that we're looking at. That's it,
thank you.
>> Ofer Dekel: Okay, any questions?
>>: So you will assume that the labels are independent, but in most real-world scenarios they
are extremely dependent.
>> Asela Gunawardana: No, no, I don't assume that the labels are independent; I'm assuming
that given the history of all of the labels I want to know the intensity of particular targeted
>>: But I thought you said, because we have the red, right? And the red and the blue, right,
and the blue didn't predict the red.
>> Asela Gunawardana: So in that example I wanted to get across that I could represent
independence, but in general what happens is I might ask a question about red when I'm trying
to predict blue
>>: That's why you have the complete?
>> Asela Gunawardana: I have the complete, and that's why we can do the health and wellness
queries. In fact, one of the slides that I actually dropped was we can predict whether a user
who has never been interested in health and wellness before will be in the next week, which is
actually arguably…
>>: But how did you model that, because this space is very large, right? It's only…
>> Asela Gunawardana: Right, so we had to actually do the exhaustive search over all of the
labels, and for queries what we do is we actually use features of the queries and not the queries
>> Ofer Dekel: Any more questions?
>>: I had a question about the very beginning when you talked about the inappropriateness of
standards for the fixed class models, like the Markov models for the [inaudible]?
>> Asela Gunawardana: I wouldn't worry about that. I think that's a little bit too strong, but…
>>: But I guess, but the question is from just a practical perspective one could have doubts
about the times between the events and that's could be observations could be [inaudible].
>> Asela Gunawardana: Sure.
>>: And so it's no longer the case and it's sort of a temporal Markov assumption it would be an
event Markov function if any of those types of Markov [inaudible] so why would that not be the
>> Asela Gunawardana: Event Markov would be maybe a little strong, right? You could.
>>: You could have whatever order you want.
>> Asela Gunawardana: Right. It could be whatever order you want, and in fact, you could do
>>: In fact in that case one could use, for example…
>> Asela Gunawardana: And I think that it would, I think one way of doing that it would reduce
to the same thing. And the only, and once you do that, it turns out that you don't need the
[inaudible] time anyways, but I think that in practice, that distinction may not be as important.
It's a way of being able to separate out the timing of how you step forward in a DBN if you will
with the resolution of the questions that you ask about the past.
>> Ofer Dekel: Any other questions?
>>: Is there any role for classifiers here? Because I was looking at your two examples and I
immediately classified the top guy as a gun nut.
>> Asela Gunawardana: So if you know ahead of time what classification task that you have,
that particular classification task that you care about, that is definitely the way to go. The
setting I'm looking at is where we want to query for arbitrary future outcomes, so we have a
model. So think of graphical models as opposed to classifiers and the non-temporal role and so
I'm looking at a generalization of say graphical models to temporal event streams, but of course
you can go with classifiers. In fact, doing the forecasting you could think of as a particular
>> Ofer Dekel: Time for one more question.
>>: Since you are doing it unsupervised, could you at least optimize for some kind of query
resolution, so would you make the model more complex or [inaudible] some kind of resolution
[inaudible] queries that you could optimize for?
>> Asela Gunawardana: So you could wrap this whole thing, so what we did was training based
on likelihood, right, and you could wrap whatever loss function that you carried into it.
>>: What about penalty?
>> Asela Gunawardana: Or penalty function, yeah.
>> Ofer Dekel: Let's thank the speaker.