>> Asela Gunawardana: So hi, I'm Asela and I'm going to talk to about models for temporal event streams. This is of course work with a bunch of great collaborators. To start off what do I mean by temporal event streams? I simply mean streams of events that have both a timestamp and a label, so in this little cartoon I have kind of a simple one. There are two kinds of labels, blue and red and over time the events occur and some of them are blue and some of them are red. This is a cartoon view of a bunch of kinds of data that we actually care about like search query logs, webpage visits, Facebook check ins and data center event logs. For example, with search queries instead of having just blue and red we would have a whole bunch of different possible labels corresponding to different queries, but they all have timestamps. What we want to model is what events are going to happen and when they are going to happen and this would depend on what has happened in the past and when it happened in the past. So it's not just what, but when and I'll keep repeating this and hopefully you will eventually see how this is different from what's been done before. Why do I want to do this? Well, one reason to do it is to simply understand what's going on in the systems. For example, imagine that we have a data center and all of the machines in the data center log all of the errors that they spew out and that they have some kind of automatic management system that also logs everything that it does, and the data center is humming along and there are some errors. Then the management system decides to maybe install patches on some machines and reboot some other machines and re-initialize some other machines and eventually every so often it gives up and somebody's pager goes off. And they come in and they get in little golf cart they go to the machine and they actually putts around with it and that process is actually really expensive. I might want to know why I was sending a technician to the machine. If I look at the error logs on that machine what I see is a complete mess, a bunch of different events happening over time, and at about three o'clock in this case we sent a technician to the machine. Well, why? Well, you could painstakingly go event by event and try to figure out what's going on or if you had a model that understood the dependencies between these events, it could show you that these are the two key events in the past that caused the technician to go there. The C drive failed; the automatic system tried to reimage the machine and that didn't help, so the pager went off. Another thing you might want to do is try to predict the future, so say that you are running a search engine and you have users putting in queries so you kind of have an idea what they are interested in and an advertiser comes to you and says I'm an advertiser in the health and wellness industry, can you please show this ad? Plaster it on the top of Hotmail, but just two people that are actually going to care about health and wellness. I don't want to wait until they issue the query. I want to show it to them the whole week before or after. You want to figure out which of the users actually is going to care. So you've been watching a couple of users. One of them searches for Elizabeth Edward’s will and another one searches for pellet guns and then they search for camo, and the next day they search for online education, and then they search for Valentines and they misspell it and then they immediately correct themselves. The other ones two days later searches for snow pictures and then Versailles Palace, and this is when the advertiser comes to you and says show your users this health wellness ad. You're sitting there scratching your head saying who should I show it to? So how many people here think that we should show the ad to user A? Only one person has an opinion, okay two, user B? You guys may be right. You may be wrong. But let's see what happens. The first user searches for Amazon and then for chairs on sale, then adoption and the Dave Thomas Foundation which I had to look up. It actually works in adoption. And then those of you that said user B, you were right, exercise equipment and high blood pressure. What we would want is models that can actually make these predictions. Maybe we could crowd source it, right? [laughter] But then show the ads to the right people. One of the challenges when we are trying to build these models is that data is really bursty, right, so when the person corrected their spelling on Valentines and when they search for pellet guns and camo, those queries came within tens of seconds of each other and then sometimes we didn't see him -- seeing him, maybe I'm wrong because of the pellet guns and the camo, but for five days, four days, so the time spans can vary from seconds to days. This is a challenge if you are using discrete time methods like DBNs, HMMs or Markov chains because well, you could ignore the times, but I think it's pretty easy to argue here that the times are kind of informative. Or you can decide to sample every 10 seconds or so so as not to lose any information, but now you've got these long stretches of tens and thousands of steps where nothing happens. This makes it really hard to learn the dependencies involved, and it makes it hard to do inference or forecasting forwards over tens or thousands of steps. So a different approach is to do the modeling directly in continuous time and there has been a bunch of work that has tried to do that. We actually tried a few of these. It turns out that a bunch of these methods restrict the kinds of dependencies you can have on the path. It wasn't clear whether the assumptions that were chosen for those particular models actually matched the data that we care about. Another problem is that we want to operate on Bing query logs which are little big. A lot of these methods don't scale very well to that kind of data, and so what we want is first to make less restrictive assumptions about the data, and second still be able to scale to these huge data sets. So what I'm going to tell you about is a technique we came up to do that using a modeling technique involving conditional intensities and I'm going to take a little detour to explain what those are. To be a little more formal about the data, you can mathematically what the process we are looking at is by is a marked point process. It's a point process because it's generating times and it's marked because each time is marked with a label telling you what happened at that time. You can write a likelihood from a marked point process forward in time predicting the time and the label of each event given everything in the past. What's really cool is marked point process theory tells us that we have this conditional intensity representation for any marked point process. Those of you who have looked at Poisson processes might, if you squint a little bit and hide the conditioning on history, it looks a lot like a Poisson process. But let me quickly give you a sketch to convince you that that representation is possible. Say you just had a density over times. You could then define the intensity as the density divided by one minus the cumulative distribution function, the CDF. Now the derivative of the CDF is the density, so multiplying the top and the bottom by the denominator you get this differential equation and now you apply to your calculus and you find that the density is just an exponential just with the time varying rate. Now if you have a time and a label what you do is you look at the density of the time and the distribution of the label given the time, and you define the pro-label intensity as the overall intensity times the conditional of the label, and now you plug-in and this is exactly the same formula we had before except you need to condition on history and pull the sum and the exponential out and make it an end product and now we have this form. Okay, mathematically you might say okay, I can go check the math and it's probably right, but what does this mean, so going to give you some intuition about what this representation means. Let's go back to a cartoon world where we only have two kinds of events, blue and red and we don't have any events yet, so the history is completely empty and so what we have are two conditional intensity functions that I give you for red for blue how fast do you expect red events to happen at every point in time, and how fast you expect blue events to happen at every point in time? Here is an example that I just pulled out of the air and drew a line on the graph. Say a red event happened. Now this is where this differs from a Poisson process, because Poisson processes have independent increments. These are conditioned on history, so now we have a red event in history, we condition on it and our expectation about the future changes. You can see that given that a red even happened what I expect red things to do in the future just changed but blue stayed the same. Now the red event happens, we update what we expect to happen and so on. Blue event happens, we update and you could generate all the data and then you could draw the conditional intensity every time given everything that happened before. Now there are two things going on with this representation. First, the behavior of the reds is characterized just by the red curve and the behavior of the blue is characterized by the blue curve so I can specify them independently. The second is that the red curve, the blue curve didn't change when red events happened, so the occurrence of blue events is independent of red events in the past, so we have a way of characterizing our temporal conditional independence. We have this representation and what we would like to do is learn these representations from data. In order to do learning we need to make a modeling assumption, and the assumption we make is that the conditional intensities are piecewise constant. For a large class of smooth functions you can approximate them arbitrary closely with piecewise constant functions so this is kind of a nice modeling assumption to make. How can we learn a piecewise constant conditional intensity function from data? Well, we are going to do it in two steps. First imagine a genie told me what regions of time the intensities were constant in. In other words when was it switching? So they colored in the timeline for me. Then all I have to do is figure out what should the rate for red be in the yellow region, and what should the rate for red be in the green region? Now that is a somewhat easier problem, right? Just count up the number of reds in yellow, divide by the duration of yellow and that should give you a pretty good estimate. In fact, you can choose a conjugate prior for the likelihood of this process that gives you exactly that, right, with smoothing of course. Now how I come up with the coloring? Well, we can push this a little further. I'm going to use the symbol s to indicate a function that colors the timeline because that defines the structure of our model. Once you give me the structure I just show you how to estimate the parameters. We can put a prior on structures. I, ways of coloring the line and if we choose the prior carefully we can actually get a closed form for the probability of the coloring given the data. Well, not quite, up to a constant factor, but this is good enough because if you give me two different colorings now I can just compare the probabilities of both of them and pick the more likely one. This is exactly what we're going to do. We're going to start off with a basis set of colorings. And what I mean by a basis is that I'm going to start combining these but I want to show you what they do. So first guy it says I'm going to color the timeline dark if there was a red event up to a second ago. So the first red event happens, then we start coloring. A second after the burst of red, we stop coloring and we start again when another color red one happens and so on. The second guy the same, just shifted over by a second. The third guy looks at blue and so does the fourth. So these are four different colorings of the timeline and now what I can do is say if I want to predict reds, which coloring is most useful based on the Bayesian score that I told you a couple of slides ago. And I pick the best one. Now I can go one step further. This is a function with two values, right. I can think of it as a binary decision stump. If this function equals one, take the colored leaf. Otherwise take the uncolored leaf and I can use the decision stump to call it a timeline. If it is a decision stump I can start splitting the leaves and building a tree. I can take another function, say the blue one, and I can use this to split that leaf. Well, what does this mean? Wait. Hold up. The blue one, the blue function is a function of blue events in the history and what I'm going to do is color the timeline using blue events in the history and then see does that coloring help me predict red ones? So it's saying the blue events in the history affect red events in the future. Anyways, how do I split this state? I look at the overlap between the red cross hatches in the blue regions. That splits that state and now once again I can look at the score of this coloring. I can choose a different split and look at the score of that coloring and choose the better one. So I can greedily build up a tree this way and that completely specifies my model. So here is a completely specified simple toy model. The idea is we have three kinds of events, A, B and C and they are all bursty. When one happens it will keep going for a while. The As trigger the Bs and the Bs trigger the Cs, and you can specify these with decision trees. I'm not going to walk you through the trees. So the first thing we did was we generated data from this model and then saw whether we could learn it back and of course we could, but what's more interesting is remember the data center example I started with. Say we had a bunch of the event logs which we did, we can train one of these trees for each kind of message that we log and now we care about the same tech event, so we look at the tree for sending a technician. Now all you need to remember about this tree is it has four leaves and what we are going to do is we are going to 3 PM, apply the tree and see which leaf we end up with at 3 PM. It turns out to be the red one in the far right. Now we walk up the tree and say what events in the past made me color the timeline red there? Those two guys, and those are the events that now influence my decision to send a technician. Great, so we can build a model from data and we can use it to understand what's going on in our data. Next step we want to do is forecasting. Now the challenge is that this is a continuous time model so it doesn't happen to be a closed form way to do estimates of what's going to happen in the future. What we want to do is say I care about some future outcomes, say the user searches for healthcare, what's the probability of that happening? The nice thing about these models is that they are really efficient to sample from, so what I do is I sample possible futures, sample a bunch of them and just count what proportion of those possible futures does the outcome I care about happen. If we go back to that toy example, remember As triggers Bs and Bs trigger Cs and what's the probability of seeing a C in the first second? I can sample the hell out of it and find out what the right estimate is and it happens to be 3 to the minus 3, or 3x10 to the -3. Well, the problem is that it takes a really long time to converge. In fact, for the first hundred samples we never see that event happening. So I estimate the probability at zero and even after it starts happening the variances are huge, and this is a standard problem. If the desired outcome is rare, then sampling doesn't do so well and the standard solution to do, or one of the standard solutions is to do important sampling. The idea is that you sample from a different distribution where the outcome you care about is more likely and then you see what proportion of those outcomes, those samples have the outcome you care about, and you just reweight the samples to correct for the fact that you sampled from a different proposal distribution. So the challenge is to define a proposal distribution P ~such that the outcome you care about becomes more likely where you can still compute the correction factor that you need. It turns out with our models it is really easy. Say that you cared about some label L in the first second. Just take your model and boost the intensity of the label L during the first second as long as the outcome L hasn't happened yet. So what I did was I took a piecewise constant intensity, added a piecewise constant function to it so I'm still piecewise constant. My sampling algorithm still works and it turns out that I can compute the correction exactly. Here was the result from forward sampling. Use the important sampling and you converge much faster and the variances are lower. So back to the second problem that I was telling you about; which of these users is going to search for health and wellness related frames on the internet so that we can advertise to them? So I actually pulled these examples out of the logs. These are actual test cases that I ran against and this was a position recall over I think tens of thousands of users over about a month. You can see that taking advantage of this temporal dynamics actually gives us much better classification. So just some words about what we are doing now is imagine you have, what we've done is been able to build models of these worlds where events happen over time and there are some problems where you want to plan to act in this world. For example, one of the applications that we are looking at is app usage on cell phones. When people use apps on the cell phone it takes a while to load up an app so you would want to precache them and so there is a planning problem about when do you want to load up or preload apps for a user? It's important to not only know what apps to load up but went to load them up, and so we are trying to deal up planning algorithms for this setting. The other is I showed you the basis function, if a blue event happens in the last second or between a second or 2 seconds ago. Our models are completely dependent on the right choice of these basis functions, because once you have the basis functions we build trees based on them and we get estimators. What is the right set? Can we come up with a set such that we can approximate any conditional intensity? And so that is another direction that we're looking at. That's it, thank you. [applause] >> Ofer Dekel: Okay, any questions? >>: So you will assume that the labels are independent, but in most real-world scenarios they are extremely dependent. >> Asela Gunawardana: No, no, I don't assume that the labels are independent; I'm assuming that given the history of all of the labels I want to know the intensity of particular targeted labels. >>: But I thought you said, because we have the red, right? And the red and the blue, right, and the blue didn't predict the red. >> Asela Gunawardana: So in that example I wanted to get across that I could represent independence, but in general what happens is I might ask a question about red when I'm trying to predict blue >>: That's why you have the complete? >> Asela Gunawardana: I have the complete, and that's why we can do the health and wellness queries. In fact, one of the slides that I actually dropped was we can predict whether a user who has never been interested in health and wellness before will be in the next week, which is actually arguably… >>: But how did you model that, because this space is very large, right? It's only… >> Asela Gunawardana: Right, so we had to actually do the exhaustive search over all of the labels, and for queries what we do is we actually use features of the queries and not the queries themselves. >> Ofer Dekel: Any more questions? >>: I had a question about the very beginning when you talked about the inappropriateness of standards for the fixed class models, like the Markov models for the [inaudible]? >> Asela Gunawardana: I wouldn't worry about that. I think that's a little bit too strong, but… >>: But I guess, but the question is from just a practical perspective one could have doubts about the times between the events and that's could be observations could be [inaudible]. >> Asela Gunawardana: Sure. >>: And so it's no longer the case and it's sort of a temporal Markov assumption it would be an event Markov function if any of those types of Markov [inaudible] so why would that not be the approach? >> Asela Gunawardana: Event Markov would be maybe a little strong, right? You could. >>: You could have whatever order you want. >> Asela Gunawardana: Right. It could be whatever order you want, and in fact, you could do that. >>: In fact in that case one could use, for example… >> Asela Gunawardana: And I think that it would, I think one way of doing that it would reduce to the same thing. And the only, and once you do that, it turns out that you don't need the [inaudible] time anyways, but I think that in practice, that distinction may not be as important. It's a way of being able to separate out the timing of how you step forward in a DBN if you will with the resolution of the questions that you ask about the past. >> Ofer Dekel: Any other questions? >>: Is there any role for classifiers here? Because I was looking at your two examples and I immediately classified the top guy as a gun nut. [laughter] >> Asela Gunawardana: So if you know ahead of time what classification task that you have, that particular classification task that you care about, that is definitely the way to go. The setting I'm looking at is where we want to query for arbitrary future outcomes, so we have a model. So think of graphical models as opposed to classifiers and the non-temporal role and so I'm looking at a generalization of say graphical models to temporal event streams, but of course you can go with classifiers. In fact, doing the forecasting you could think of as a particular classifier. >> Ofer Dekel: Time for one more question. >>: Since you are doing it unsupervised, could you at least optimize for some kind of query resolution, so would you make the model more complex or [inaudible] some kind of resolution [inaudible] queries that you could optimize for? >> Asela Gunawardana: So you could wrap this whole thing, so what we did was training based on likelihood, right, and you could wrap whatever loss function that you carried into it. >>: What about penalty? >> Asela Gunawardana: Or penalty function, yeah. >> Ofer Dekel: Let's thank the speaker. [applause]