>> Arvind Arasu: ...he's a Ph.D. student at University... interesting work on problems to [inaudible] and we'll be talking...

advertisement

>> Arvind Arasu: ...he's a Ph.D. student at University of Washington and has done some interesting work on problems to [inaudible] and we'll be talking about the same.

>> Christopher Re: So I'm going to talk about my dissertation work, or an aspect of my dissertation work in managing large-scale probabilistic databases. And before I get started, just feel free to ask questions during the talk at any time. I don't mind; it makes it more entertaining for me. So if anything is unclear or you just want to poke fun at me, just raise your hand or something. Thanks.

Okay. So let me give you the one-slide overview of what I'm up to. So today applications are driven by data. And increasingly that data is large and imprecise. So let me give you three examples of this. The first example is sensor data. This data is imprecise because we're trying to measure the physical world. And whenever we try to measure the physical world, we have things like measurement error, which is necessarily imprecise.

A second example are business intelligence applications. Here imagine a competitor of something like the Kindle 2, so some competitor of Amazon, they want to know how well the Kindle 2 is doing in the marketplace. So one way to do this is to go and extract posts from news and blogs and things like that, extract some facts, and then do some complex analytic querying on top of them. Now, the problem is that that step of extracting facts is necessarily imprecise.

A third and more classical example is data integration. Here we have two databases that talk about the same real-world entities. But the way in which they refer to those entities is different. So we have two options. We could make sure that all the references are consistent, which is a very expensive task of data cleaning, or as myself and some others have advocated, we can tolerate some imprecision during the merge and thereby lower the cost of data cleaning, or data integration.

Now, in each of these applications, this is the kind of data we'd like to manage in a database, maybe a streaming database, maybe a relational database. We want things that a database would provide, like rich structured querying. But the problem is that current databases are precise. They can't understand the imprecise semantics of the data that we're talking about.

So in this talk I'm going to tell you about my work, which is about databases that can manage large imprecise collections of data. And the way that we're going to model that imprecision in my work specifically is using probabilities.

Now, probabilities are great. They give us a uniform way to talk about all those different kinds of imprecision, so we can get to the fundamental data management issues. But there's a cost to using them as well. When we go to do things like answer a query, we have to combine all these different alternatives. And this is much more difficult than standard relational processing.

So the challenge I'm going to talk to you about here is about managing the performance of rich structured queries on probabilistic data. And that's going to be the focus of our talk.

Okay. So first what I'm going to talk about are three imprecise database applications in a little bit of detail so you can see both how the data look like and what the queries look like and what the actual imprecision we're modeling is. Then I'll tell you a bit about managing relational probabilistic databases which really forms the core of my dissertation and resulted in a system MistiQ, which is the screen shot on the side.

Then if we have time, I'll also tell you about some other systems they built, including one for streaming probabilistic data and some things -- some other side projects. Then I'll talk about future work and also my vision for the future of data management and finally conclude.

So let's talk about those imprecise databased applications. Well, the first one I want to tell you about is RFID. So RFID is the technology that lets you track objects as they move through space. Originally RFID was used to track things in supply chains, like you have a pallet of 30 Mach3's sitting in your warehouse.

But recently people have realized that RFID enables much richer applications; for example, equipment tracking in hospitals. So we want to build applications on this kind of RFID data, and the kind of things we want to do are answer structured queries, like alert me when Joe enters 422. Now, what you see on your screen here is a deployment that we'd like to answer this query over, an RFID deployment. A through E signify the readers, right, and Joe, our little good husky wearing purple on the bottom, has a tag on him, an RFID tag. And the idea is that when he walked by one of those readers, he'll be read, the system will know where he is and hopefully will be able to answer this query.

So an ideal system, it would look something like this. Joe walks out of the room and he walks down the hallway, and as he walks down the hallway, you can see that his readings are recorded in this table, and when he enters 422, we hope the system says, ah, Joe entered Room 422 at time T equals 7. And you can imagine that the way this system would do this is that, well, it didn't see him at antenna E and it reasons that perhaps he went into 422. This is the ideal picture. This is where we'd like to get.

Now, unfortunately, the challenge we're going to face is that this data is actually imprecise, and it's imprecise for three reasons. The first reason is that readers are not in all areas of interest. So in our little cartoon example you see that there are readers in A through E, but they're not in 422. Right? And you can imagine in real deployments, like, for example, in a hospital, you can't put a reader inside the intensive care unit. The equipment in there is very sensitive and not shielded. So you'll always have some areas that are off limits to these readers.

A second challenge and a little bit more subtle is that there's an abstraction mismatch. So you see the kind of data we get from the deployment. Joe's tag is red at time 3 by antenna

A. It's at this very low level of granularity. But the queries we want to ask are at a much higher level. There are things about rooms and offices or maybe ICUs. And it's pretty clear that there's no 1:1 mapping between those antenna readings and actually which room Joe went into. In this example, given that data, he could have just as easily gone into the room with D in it. So there's some ambiguity or imprecision in the data.

A third problem is that in real-life deployments like this, readings are often missed. So even though we were able to reason, for example, that he went into one of those two

rooms, because we didn't see an antenna reading at E, the reading at E could have just been simply missed. And this is actually a real problem in deployments if the RFID readings touch your skin or the reader touches your skin, the read rates drop below, for example, 30 percent. So this is a real problem as well.

So what we're going to do to handle all these things so we can answer our rich structured queries is we're going to do some preprocessing, we're going to borrow from the large AI literature, and we're going to build a probabilistic model that's going to take in as input these antenna readings and give us back something in a higher logical level in terms of offices and hallways.

The cost of this transformation is that now the data are going to be imprecise and we're going to have to deal with this imprecision when we do query processing. So let's look in a little bit more detail what that preprocessing looks like.

So what you see on your screen here is a picture of the sixth floor at the University of

Washington in our computer science building. It's an architectural diagram. There are now some offices and hallways, and the blue lines are a connectivity diagram.

Now, these green circles are actually those readers or antennas that are lighting the hallways, the RFID readers. And as you see they're lining only some of the hallways.

Now, in the movie that I'm about to show you, this little blue ring here who represents

Joe is going to walk down the hallway, walk into a lab, and walk out the other side.

The thing to pay attention to is that as he walks by an antenna, the antenna will change from green to yellow, and you'll see that there'll be some periods of time when he's actually in the lab that he's not read by any of these antennas. The reason he's not read by the antennas is, well, there are no antennas in the lab, so pay special attention to when he makes that left turn and goes to the lab.

So here Joe's walking down the hallway, he's read by one or more readers at each time.

He goes into the lab and after a little while he's not read. Then he walks down the other side. Now, you can imagine that those yellow flashes are really the kind of data we're getting back from the deployment. You can imagine easily how those yellow flashes translate into the kind of table that I showed you a slide earlier.

So now what we're going to do is we're going to do some inference process. We're going to try and go from this low-level data, those flashing lights, to some higher level data.

And the inference mechanism that I'm going to show you is a very simple and intuitive one. There are alternative ones, and this is a prior art. It's called particle filtering. So the goal of particle filtering is to infer where Joe actually is given some of those antenna readings. So these little origin circles here, origin particles, I want you to think about them as a guess of where Joe actually is, given the antenna readings that you've seen so far.

So now we'll watch the movie and see how the particle filters actually track him. And pay special attention to when they get next to that lab. So you see the particles do a pretty good job of tracking when there are sensor readings, and there's a little visualizer lag coming up. But there you see that the particles, when he goes through that lab when there are no readings, are able to interpolate the data. So some particles went in one lab

and some other set of particles went in a different lab. So this is what the data actually looks like. This is what the probabilistic data actually looks like.

So what have we really done here? Well, we've gone from the low-level data, which were in terms of these antenna readings, and having all the sorts of imprecisions, and we've abstracted it up to get some data that looks like the data on the right-hand side. For example, at each time period we just count the number of particles that are in an office or hallway or so on, and now we have a distribution over logical locations.

Now, looking at that table, we've solved already one of our problems. You can imagine how you would write such a simple structured query, how you would build a system to process on top of data that looks like that. If it didn't have that probability attribute, that special semantic attribute.

Now, the challenge is going to be that we're going to take this as input to our system, data that looks like this on the right-hand side, and we're going to process these structured queries. And the challenge is going to be that those structured queries are going to be difficult to process because we have to combine all possible ways of where Joe was at some time. Maybe he's in the office, maybe he's in the hallway and so on. And we're going to focus on really efficient processing for that.

Let me give you a second application of imprecise data that has nothing to do with physical measurement but where the data still result in some amount of imprecision. So what you see a picture here is a picture of my iTunes while I'm listening to a jazz song.

Now, the interesting part for us is this right-hand side bit over here. This is called the iLike sidebar. ilike.com is a Seattle-area startup that some of you may or may not know that basically their business model is they want to embed a social network into iTunes.

Now, we're going to focus on this sidebar, so I'm going to zoom in on it right now. The idea of this sidebar, how they're going to make money, is that they're going to recommend things to you like which shows to see or which songs you want to buy. And to do that, they're going to leverage the fact that you're embedded in some social network.

So here in the social network I can see, for example, that some of my friends are listening to songs and so on. And if I click on this related tab, what I can do is get some songs that are related to the song I happen to be listening to in my iTunes, so they may buy it, for example.

So when I click on related, what I get back are a bunch of jazz songs. Now, the bunch of jazz songs that I get back that it recommends me as good matches to the song I'm listening to is a very rich similarity. It's combined a bunch of information. Not only has it combined the fact that I was listening to a jazz song and those are jazz songs that are returned, but it didn't give me, for example, songs that were already in my library.

More interestingly, it also does things like look at my friend network and see what are songs that my friends have that I don't have. Now, the point of all this is that it combines a relatively large amount of data to compute these rankings, and it does it very, very efficiently. And just to give a size of the data, there are around 30-plus-million users of iLike right now. It's one of the bigger applications on Facebook. So it's a nontrivial amount of data that they have to crunch to compute this recommendation.

Now, you could certainly imagine building a one-off matching engine that would do exactly this similarity computation. But this very same data that they need to access, that they generated for the similarity computation, their business needs to access in different ways. So they want to recommend, for example, which shows should you see. This is one of their business style queries. And to do that, they need to ask complex queries that look a lot like the one on the screen. Who should I give tickets to, Lupe Fiasco tickets to.

So to answer this query, what they actually want to do is they want to do some analytic-style queries, almost OLAP-style queries on top of this data, some ad hoc queries. And the point of this is that there's some value now in DB integration, because their alternative approach would be to build sort of one-off engines to do each of these tasks that touch the same data.

Now, the point is that we would want to support exactly this kind of integrated querying with the matching and the similarity and also these complicated structured queries on top.

The challenge for us is how are we going to process these efficient -- these queries efficiently at really large scales. Because you see that the data itself is actually several gigabytes of data of similarity and friend data and things like that.

Okay. Let me give you one last example with a little bit more detail in it so we can go a little bit deeper about it. This is combining information extraction and some D duplication techniques. So the motivating query for this example is we want to find papers in VLDB 2008 written by authors at the University of Washington. Now, one way to answer this query is to build some extraction engine and go extract these papers from people's Web pages.

So imagine we were able to do that part perfectly. What we would get out of the information extraction is a nice structured table that would tell us, for example, you know, the title of each paper, the author, and the conference and the year. So we get some nice, rich, structured information. Now, if you're an expert in information extraction you know that obtaining this table itself is very, very difficult and can also be modeled as imprecise data. But I'm going to ignore that for the moment because I want to focus on something different.

So if we were to take this query and evaluate it on this data, we'd face a very well-known problem. Even though our query talks about VLDB and the entity VLDB is clearly inside of our data, it's actually represented in two slightly different ways, by different strings, right, actually in three different ways.

So if we posed our query on this table naively with a standard relational engine, we would get no answers. So what do we do here. We do the standard thing. We compute a fuzzy join. So the output of a fuzzy join, the way we -- what happens is we look at each pair of strings and we compute the extent to which they're similar. And this output of this fuzzy join we can model as a simple table here, a simple relation. And what we'll do is we'll model that score as a probability, and I'll come to that a little bit later.

But given this data, you can imagine very easily how you would combine that incertain or imprecise information of similarity to compute the answer to your query. You could probably imagine writing some ad hoc program that's going to combine scores in some way and compute an answer.

The problem you face is if you want to write much richer queries, right? Each time you write these queries, if you don't have a principled semantic, you have to write some new ad hoc program to do it and it's not clear how you're going to be able to transfer from one domain to the next. So we want to support rich structured queries like SQL on top of this kind of imprecise or uncertain data.

And, again, the general is going to be that we want to support these queries efficiently.

And we're going to consider datasets that are large in the sense that they would be something you would stick inside a standard relational engine, and now we have to do all this extra overhead of processing this uncertainty. So that's going to be the challenge I'm going to talk about today.

Okay. So what I want you to take away from this application, I want you to take away three things, but two are on the slide. The first is that there are a growing number of these large imprecise datasets. I showed you three examples so far. The second is that in many of these datasets there's value in rich structured SQL-like querying on top of them, or at least rich structured querying.

The third takeaway is that the challenge we're going to have to deal with is processing these things efficiently. And we'll get into a little bit more detail of why this is actually a very serious challenge, but this is the challenge we're going to address in the talk. Please.

>>: [inaudible] let me ask why you think doing this stuff in a relational database is right.

Let me give you an example.

>> Christopher Re: Sure.

>>: As I understand your examples, you seem to be putting marginals into your database, the probability of being Room 422 given everything else, the probability that this person likes this music [inaudible] and everything else. Well, in order to do the structured queries correctly, you have to go back to the join that the marginals computed from and then do sort of inference or essentially do summing in different ways. It seems like if you've destroyed a lot of the information you might need to compute [inaudible] exactly -- it's kind of like the same problem when you compute query plans in a database query itself you only have certain marginal accounts of the individual columns so you don't have the relationships between them. Why is [inaudible]?

>> Christopher Re: [inaudible] it's a great point. So the way I would state your point is I haven't shown you the model yet, the probabilistic model yet, but you are correct that it's a very simplified model compared to something that you would find in the graphical model literature or the statistics literature that can model the kind of rich correlations you're talking about.

The spin that we're putting on this, or what we want to do is we want to model relatively simple correlations and focus on the query processing. So the query processing will be correct. That will be the contract. The question you ask is how well does the model, for example, information extraction model, match the model of the contract I'm going to provide for you in a few slides.

And I'll show you that model in a little bit and we can have a more technical discussion then, but the point that I want to make is that there are a lot of applications where you can do very interesting things inside the simplified model. So you're absolutely right, there's some information loss going from the rich information extractor model, CRF model, into the simplified model I'm going to tell you about with just marginals.

But it's unclear to me that that -- modeling those extra correlations will buy you all that much for the kinds of queries we're looking at. So once I get to the end and talk about this project and also talk about another project, I'd be more than happy to have a debate about where the richness of model question falls in in these systems. But it's an excellent point and it's something we're very much aware of.

But, yeah, so what I'm going to focus on for the rest of the -- for at least the next little few slides is what a generic system would look like and the fundamentals that we have to do to do the processing, assuming you've gotten the data into the model that I'll tell you about in maybe five slides.

So let me just give you -- to be a little bit more concrete, let me just give you a picture of how a generic system would look like. So we'll go back to that information extraction example where we have the probabilities and we have the original data. Maybe this data itself is actually imprecise. Now, the way you would use this with our system, in a system like MystiQ, is that you would take the data and load it into a standard database.

Now, the problem with doing this loading is that the database is not aware of the semantics, it's not aware of that special P column over there, this probability column.

So what we're going to propose is that you'll use MystiQ, our system, to query the data, to interpret and understand this probability column and process the queries correctly.

So here's a screen shot of MystiQ. What it gives you is an interface where you can write

SQL queries and get back some results. Now, the query that's actually written here is exactly that query that we've been informally discussing so far in the talk. Find papers in

VLDB written by authors at the University of Washington. The point is that you can imagine writing almost arbitrary SQL here and getting back answers. That's the kind of interface we want to provide to this dataset.

Now, how do we use those probabilities, those marginal probabilities? We use them to rank the answers, the answers that come back from the database? So each tuple now is not precise, is not in the answer or not in the answer. It's only in the answer with some score, some probability. The marginal probability it appears in the answer. And we're going to pick to give back to the user the most important of those marginal probabilities, or the highest-valued probabilities.

And in this way we've traded the fact that we didn't get any answers, we've increased the recall at the expense of some precision here. So that's how we're getting back more answers now.

Now, this would also point out why ranking is such a fundamental thing, because we're always going to use these things to rank them, and so we'll see in a little bit about how to compute the top ones efficiently. Okay.

All right. So now I've gone through the motivation, now I'm going to talk a little bit about actually relational probabilistic data. This is the kind of data that I've been showing you so far in the talk, and the fundamental techniques you need to scale up to gigabytes of data.

So the model that we use is a very simple model. It goes back to at least 1992, Barber

[phonetic] et al. And the way I read this database is that it's a database that contains two entities. There's an entity that is PDP that was for sure published in 2008, but we're uncertain about whether or not it was published in VLDB or whether it was published in

SIGMOD. It was published in one of the two, but we don't know which.

The extent to which we believe that it was published in VLDB is exactly 0.62. This is the -- since it's a high score, we think it's more likely that it's published in VLDB than it was published in SIGMOD. Independently from that, there's a second entity in the data, uncertain DBs, which was published in 1992. And you see here that there are three alternatives for this database. And above and below the line, we're going to assume these are independent, for example, if they're independent extractions. So this is a very simplified model of the database. Please.

>>: [inaudible]

>> Christopher Re: Yes. So I'll get back to that. So we don't insist that they add up to one. The probabilities add up to one. And I'll show you on the next slide. Essentially when it doesn't add up to one, this is the probability you can think of that it was a bogus extraction and it's always just a simplified thing in the model.

So that P is a special thing that's just basically these marginal probabilities, but now I'm going to show you what it actually means.

So what it means is the standard thing. It's a distribution over probable worlds, or just a discrete probability distribution. So in the entity, as Raga [phonetic] correctly pointed out, we allow the possibility that, for example, PDBs was not present at all, so there are three alternatives for PDBs. Similarly, there are four alternatives for uncertain DBs, and since all combinations are possible, there are 12 possible worlds here corresponding to all the choices we could make.

So each one of those choices we'll refer to as a possible world, and here's one possible world. PDB, we chose that it's in VLDB, and uncertain DBs we chose that it was an

ICDE.

The weight of that world we compute because we've assumed independence, we simply multiply the probabilities, the marginal probabilities. Now, in this way, we can get a distribution over all 12 such possible worlds. So every choice.

For example, there was a world I just crossed over. Click twice. For example, if there's no world we use the one minus complement. Okay. So the point of this is that very easily we think about this database as representing exactly this distribution. So it's written down exponentially more succinctly, but this is really what the data means; it's this collection over possible worlds with some associated marginal probabilities.

Now, the nice thing about this semantic is it's well founded so it immediately gives us a semantic for any kind of query, any kind of Boolean query. And the semantic for a

Boolean query or, more concretely, that score that comes back with a tuple in MistiQ is exactly the marginal probability that something is true.

So, for example, if we wanted to process the query that PDBs was in VLDB and uncertain databases was not in PODS, we would go back to those 12 boards semantically and check on each one; does it satisfy the query, is the query true on this world.

So on the first world the query is true, so we're going to keep it. Similarly, in second world it's true, so we'll keep that as well. On the third world it's not true because the second database, uncertain DBs, is in PODS and we'll throw it away.

So in this way we can, again, semantically, anyway, go through all 12 possible worlds and compute exactly the marginal probability that this query is true. So it gives us a nice well-founded semantic.

Now, if you're clever and you looked at this, you could say, well, you told me it was independent, you could just compute them directly from those marginal probabilities over there, multiply P1 times 1 minus P5 and be done. And what this points out is that we don't always have to go through this exponential blowup to compute those probabilities, and of course that's what we're going to want to do. Okay. Whenever possible, we're not going to want to go through this expensive process of going through all the exponential in many worlds to compute a query answer.

So I'll go through there. Okay. So at this point you may be thinking, well, maybe the queries are simple and we'll always be able to compute it by just multiplying probabilities because, you know, the model is so simple and maybe you think SQL queries are so simple. But in fact it turns out that they're sharp P-hard to compute these queries or

NP-hard. So even though the model is very simple, for very small sizes of queries, actually it turns out to be intractable, theoretically anyway, to compute queries on this.

So for the experts in the room, what I mean is for conjunctive queries, when I say SQL for conjunctive queries, the data complexity is sharp P-hard for these queries. So they are theoretically quite difficult to deal with, even in this already simplified model.

Now, you may be thinking, well, we're sunk at this point because it's just as hard as inference, it's NP-hard or sharp P-hard, what have we really bought by this model, but it turns out that actually the queries that we want to evaluate can be well approximated by sampling. Okay.

So for the experts in the room, what I mean is that there's an FPTRAS, a fully polynomial-time randomized approximation scheme, which can compute exactly those scores. So whenever tuple comes back, there's some sampling routine that's guaranteed to be efficient to compute those scores in theory.

Now, it's efficient in theory, but actually in practice it's about one to two orders of magnitude slower than standard SQL processing. And this is one of the things that we're going to try and address in this talk. We're going to talk about how to mitigate the expense of the sampling so that we can run almost as fast and standard relational

processing. That would be the gold standard, that's what we're after. But we're up against something hard here, right? Theoretically these queries are much harder to evaluate than standard relational queries.

Okay. So now I can talk about my dissertation contributions. Essentially the MystiQ system that I just showed you is the first large-scale probabilistic relational database.

And Prior R [phonetic] was dealing with hundreds or tens or thousands of tuples with limited queries. But in this work what I'm going to show you is how we can handle rich structured queries on billions of tuples. And the number is meant to be shocking, because even SQL databases will have trouble with billions of tuples. Right? So the fact that we can deal with this on probabilistic databases means that we're doing something interesting here. Please.

>>: [inaudible] tuples, do they rely on top-k processing for a small number of K?

>> Christopher Re: So we're going to see in the second one I give you the technical contributions, that we're going to have basically two main technical things that I'll talk about to get there. One is going to be top-k processing. So if your query is only looking for those top five or ten answers, we're going to leverage that to not run the simulation to compute all those probabilities. And I'll get into that in some detail.

The second thing we're going to worry about is precomputation, to scale up to really these billion-sized databases. And what's going to happen there is that we're going to use materialized views. And those are the two contributions. Another thing that we can use to scale up to these large datasets are compression.

So because it's a job talk I have to hit you over the head with something. What I'm going to hit you over the head with is the theme of my work. So what I'm going to talk about and I'm going to highlight with these little reddish and bluish colors here are that we're going to do some formal work for practical results.

And so the red box will show whenever there's a theory result and the blue box will show whenever there's a practical result, so it will be impossible to miss because I'll mention it again when it happens. So that should be fun for everyone to look for.

Okay. So let's get into those technical contributions I was talking about. Essentially for relational data there are four main parts of my dissertation for relational data. So for relational data, as Christian mentioned, there's this top-k processing that I'm going to get into some detail because it has nice pictures. Then I'm going to talk a little bit about materialized views, less nice pictures, but a little bit theoretically deeper.

And then there are these other two which are approximate lineage and aggregate processing, and due to time constraints and attention spans there's no way I'm going to go into them. But if you want to ask me about them, I'm more than happy to talk about them. So I'll try and give one-sentence summaries at the end.

So the first one I want to go into is this top-k processing. Let me motivate this problem for you. Remember we relaxed the database, right, we allowed more matches to our queries and we were going to score them. But when we allowed more matches, we

allowed a lot more matches. And we only want to show the user, for example, the top five.

This is very natural. Think about Google. You enter in a keyword query, there are billions of hits or billions of related documents, and you really only want that top five or ten to look at. So top-k is a natural thing to study here.

So what we're going to do is try and find that top-k efficiently. Because we have this very, very expensive sampling routine that we want to optimize.

So to state the result, what I want to do is give you a very simplified model of sampling that actually holds up well in practice, okay, so we can -- how I want you to think about sampling.

So the way to think about sampling is that we want to compute for each one of those papers that comes back that marginal probability, that score. And that score as I'm denoting with P here sits somewhere between zero and 1. Now, we don't know where P is. If we knew for P is we'd be done, we could just sort them and rank them by probability. The game is going to be to try and find out where P lives.

Now, initially, P lives somewhere between zero and 1. And how I'm going to illustrate that graphically is an interval that spans the entire distance between zero and 1. Now, with way I want you to think about sampling is that we're going to run iterations, sampling iterations, and our uncertainty is going to shrink. So as we run in increasing number of simulations, that sampling algorithm I was telling you about, our uncertainty about where P lives is getting smaller.

One other thing about the way we're going to model sampling is that in fact the intervals will be nested. Now, if you're an expert in Monte Carlo sampling, you know this doesn't actually hold for general Monte Carlo sampling. The intervals are not nested like this.

But we show that in fact you can view them as being nested with just a little bit of slop.

So the mental model I want you to have is I run more simulations and these intervals shrink in a nested way. Always contain the true probability.

So let's consider. What's a naive method for top-k? Well, let's say I want to process this top two papers that were in VLDB 2008 by authors at the University of Washington. So here are some paper titles that would be returned by sort of the standard relational answers, and what I want to compute are their scores, those P values.

So initially when the database returns, I have no idea where P lives, so every tuple has an interval that spans all of zero and 1. No idea where P is for any tuple. Now, naively what I could do is simulate each interval until it's small, right, spend a bunch of simulations until they're all small, and then it becomes very easy to rank them because they become very small and I can get the top 2. But notice that I wasted a bunch of effort computing precisely the probabilities of 4 and 3, really reducing their intervals when all I was after were the top two.

So the question that I'm going to ask now is can we do better. And of course this is a job talk and it would make for quite an awkward job talk if the answer were no. So I'll spoil the surprise; the answer is actually yes.

So what is better look like? The picture for better looks like this. We'll simulate the top one until it's small again, but we won't waste much time simulating the second one and the fourth one. We'll let them be large. And we'll let them be just small enough that we can prove that we found, for example, the top two. And here we will have saved lots of work, and we saved lots of work because we didn't reduce either of those two intervals.

So this is the kind of mental model that we want for the multisimulation algorithm.

So now I'm going to show you an algorithm that can achieve -- can try and achieve a picture that looks more like this. Now, to get there, what we need is this notion that we call the critical region. The idea is that we want to have a region, so actually we'll just make this a dry definition. So the critical region for K equals 2 is graphically illustrated on the screen. It's the Kth highest left endpoint and the K plus first highest right endpoint. So there's the right endpoint, the third highest right endpoint, and the second highest left endpoint. So this is just a critical region. Very dry, boring definition.

Now, the reason that we care about the critical region is because if we were to run those simulations, what it encodes is effectively whether or not we're done. Okay. So if we -- as you can see, if we run a bunch of simulations and in fact we've separated the top two of the line, then it's always the case that the second highest left endpoint will be higher than the third highest right endpoint. So the critical region is just basically formalizing this region that we care about. When it's empty, we're done, we've found the top two.

Now, what we're going to use it for is that I'm going to now show you a series of pictures that look like this that are a picture of the critical region, and basically the algorithm is going to say which interval do I simulate next. I'm going to pick that interval, run a simulation, it's going to get a little bit smaller, and then we're going to look at more pictures. So that's the way the algorithm's going to be phrased: look at a picture, which interval do I simulate next. It's going to be a very simple algorithm.

So the first rule that we need is what we call always pick a double-crosser. So if there's a double-crosser, here's a double-crosser, that means it spans all the way across the interval, then we're always going to pick this one. And the reason we're going to pick it is because, well, you can see that there could only possibly be one interval that would be above it, and we're looking for the top two, and because the nested assumption, there's no way to get out from underneath it.

It's actually very easy to see that in fact one of the intervals that we need to simulate must intersect this critical region. So it's got to be overlapping with that guy. So we'll always have to simulate this one.

So this is the first one that we picked. The second case is if they're all one-sided crosses, either upper or lower crossers. So in this case, they're all lower crossers. Now, in this case, we'll simply pick one. Okay. Pick a maximum one. Now, the one we pick is this maximal one here and the reason that we pick it is that all these other intervals are contained underneath it. So no amount of work on these intervals here will allow us to sort of escape from this top maximal one. So we would always have to pick this one as well.

The third simple rule of the three is when there are both upper and lower crossers. And in this case we're going to pick them both. Now, those are the three rules. It's a very, very simple algorithm. One thing to notice, though, about this very simple algorithm is that in each case except for this last one it seemed like we had to do what we were doing.

It was the optimal thing to do. But in this example here, it's different. Because an optimal algorithm may need to pick only one of these two.

So what I'm getting at is that actually even for this very simple algorithm we can say something interesting about its performance, theoretically and practically. So in theory that algorithm I just showed you under some technical assumptions, on any input, those three rules I showed you is a two approximation, meaning that any other algorithm for the problem does at most half as many iterations as the algorithm I showed you there. So it's something interesting we can see.

The second point is that actually no deterministic algorithm can always do better on every instance. So in some sense that very simple rules are actually an optimal set of rules for this problem.

Now, theory's great, but there's the blue color practice, and in practice remember that we wanted to get about two orders of magnitude improvement so we could get close from, in the marketing speak, inference speed to SQL speed.

So now I want to show you a graph that will explain sort of how this thing is so wonderful and is able to go from very large running times to very small running times.

So here's the practical performance. So what you see on your screen here is a graph where we're comparing the naive method against the multisimulation method. On the Y axis is time and log scale.

Now, the experiment we ran was actually kind of fun. We went and got -- downloaded the IMDB database when it was still public, and then we went to the Web and downloaded a bunch of movie reviews. Now, we took these movie reviews and we categorize them as positive or negative. We did some sentiment analysis on them. We extracted the titles from the movie reviews as well, and then we joined them back to the

IMDB database using a fuzzy join. Then we got to execute nice, rich, structured queries over this data.

So one example of a rich structured query would be tell me years in which Anthony

Hopkins was in a highly rated movie. And you can see that we would have to join to get the highly rated movie, we would have to touch both the fuzzy join and the sentiment analysis. And also do some relational, some structured processing on this query.

Now, the point to take away from here is that you see the query execution time. It's illustrated by the blue part. It's relatively insignificant [inaudible] in our approach as the naive approach. The real difference comes in the red bar. That's the sampling time. And you see that the sampling time for the naive method is on the order of a thousand seconds while we're on the order of about ten seconds. So we have substantial orders of magnitude improvement by just looking for that top ten or top -- it was top ten in this example. Please.

>>: [inaudible] of the problem is that certain tuples are more interesting than on certain tuples. Is that always the case?

>> Christopher Re: No, its's certainly not always the case. You can imagine sort of debugging applications where you're looking for rare events, specially in aggregation queries, you may be interested in things that are very infrequent.

The model here, though, is that we want to get as close to the certain answers as possible, because we want to get as close to the standard relational semantics. Ideally in this system, we should get out of the way. So if your database were certain, then it should return exactly the same tuples, and motivated by that desiderata, ranking them by the most certain is the most natural thing to do. But it's certainly not an exhaustive set of semantics. Like you could certainly imagine looking for the rare events for something like this as well.

>>: Question.

>> Christopher Re: Please.

>>: So you mentioned [inaudible].

>> Christopher Re: Right.

>>: [inaudible] sum average, that sort of optimization may not [inaudible].

>> Christopher Re: For sum and average queries, there's actually a very interesting technical difference between them and the queries we're looking at. Here we're looking at these Boolean queries essentially. For each year did Anthony Hopkins have a highly rated movie. That's a single tuple. And then each year is its own individual tuple. That's the kind of queries we're running here. So admittedly they're not the OLAP-style count-sum-average queries.

One contribution that I won't have time to talk about is the classifying exactly those kind of queries that we call halving queries. And it turns out that they have a much more interesting landscape than the queries that I'm presenting here. In fact, this sampling algorithm doesn't always exist for them, and it requires nontrivial effort to even get that sampling algorithm to run. But that is definitely an open area and it's listed on my future work slides as some of the ways to attack it. I'm happy to tell you my ideas, but we haven't done anything yet. Yes.

>>: [inaudible] nice to have some kind of index [inaudible].

>> Christopher Re: So we leverage actually here -- so I didn't mention this, but since our tuples are stored in a standard relational database, we have access to all the persistence, durability, and indexing that comes with the standard database.

>>: [inaudible]

>> Christopher Re: It's certainly true that we will have a lot more answers than over a standard relational database. But we do benefit from indexing and the actual relational

part of the processing. What I think you're proposing is a more interesting scheme where you've also indexed some probability information and you know that you don't have to retrieve, for example, all the possible answers because you can put on some upper bound.

In this work we didn't actually consider that, but for the streaming databases, those RFID databases, we actually have an upcoming ICDE paper that has a very similar observation for top-k processing processing there. But we never applied it to relational possessing.

That's a good point. Other questions? Okay.

So we saw here that there was about a two order of magnitude difference between our approach and the naive approach. And the point I want to emphasize of course what was hiding behind that white space is that it can be much, much worse. So here's a much richer query on the same data, tell me directors with highly rated movies in the '80s, and there are a lot more directors than there are years that Anthony Hopkins was in a movie, so our approach is correspondingly much better.

There are a lot more tuples to sort through. The naive approach has to sort through all of them until they're small, but we can focus in on that top very quickly. And in fact in this example the naive didn't terminate. And this is actually not uncommon that we basically go from an inference procedure that can't terminate to something that can terminate as long as we have this top-k constraint on them.

So at least for top-k queries, we're in pretty good shape, we can process these about as fast as relational queries. I mean, clearly not as fast, comparing the blue bar to the red bar, but, you know, it's in the same ballpark.

So I've told you about top-k processing. What I'm now going to tell you about is another aggressive strategy that really allows us to scale up to huge datasets that we call materialized views.

So I'm sure everyone knows this, but materialized views are just the database way that people -- the way that database people think about caching. So let me put my spin on it.

So what we want to do while caching things is we want to cache or compute something up the front and then use that work that we've done up front to optimize, for example, later queries or to save ourselves later work.

Now, the particularly nice spin that this takes in databases is that we know precisely how that data is cached. The materialized view gives us a view description, right, which is just a query that tells us, for example, exactly how this database was generated. So here we say conference with papers from U-Dub, and it generates exactly this table. So we're going to leverage that logical view description.

And what we'll do is we'll then, you know, use this in a standard relational engine, we'll use this to optimize future queries.

Now, in probabilistic databases, things get a little bit more interesting. And the reason they get more interesting is that these tuples have a richer semantics than standard deterministic tuples. They're more akin to things like random variables. So there's some probability that the tuple is present.

And what that means is that as everyone knows random variables can be correlated. So these tuples maybe be correlated. For example, the VLDB tuples may be more likely to co-occur together or less likely to co-occur together than simply independent tuples, for example. So there's some correlations.

Now, the way that these correlations have been tracked, you know, since going back to the C tables and now with the lineage approach promoted by trio, these formulas -- these lineage formulas which are written here, the lambdas, the way I want you to think about lambda, these lineage formulas, are that they're Boolean formulas that are true precisely when this tuple is true. So, for example, if VLDB 2008 is present precisely when some tuple P1 is present and either Q1 or Q2 are present.

And if you want to know how to compute these, I won't go into semantics, but it basically records all derivations of tuples in the view.

Now, observe here also, I just want to make one little last point, observe that in fact

VLDB is correlated and the way they're correlated is because both of them have this Q2, they both rely on the same tuple underneath which means that they're not independent.

So there's some correlations in the view.

Now, what we want to do for probabilistic materialized views is compute something up front and again use it for later query processing. So one naive approach would be to compute exactly this table here and then use that to optimize future queries. Do all the inference, use all the lineage formulas and so on.

But this doesn't solve the bottleneck. The bottleneck in probabilistic query processing is not actually computing this formula, it's actually doing the inference. And this does nothing to help us to compute the inference. And let me illustrate this dramatically with a graph in a second.

So here's the graph. So again we have -- we have time in log scale on the Y axis, and what happened here is we took the TPC data and we injected it with probabilities in all kinds of different ways. And the results are actually pretty robust depending on how you inject the probabilities, but we made the TPC dataset probabilistic.

On the X axis we tried different scale factors. You can see that the scale factor 1 is about approximately 1 gig. It's actually closer to 2 gigs after you inject all these probabilities.

Then what we did is we took a query out of this dataset and we tried to evaluate in three different ways. The first way we tried to evaluate it was with no view at all. This is basically just running the MistiQ algorithm I told you before and have it return. And you see that here, that's the blue bar on the side. And this scales with the size of the data.

The second approach in red is the naive approach. So here what we did is we took that and we just computed that table that had the lineage formulas, but we didn't do anything else aggressive.

The point to take away is that it, too, grows with the size of the data. It's slightly smaller than the blue line, blue bar, but this is eaten up by the log scale. It is a little bit faster, but not too much faster.

What I'm going to tell you about next in the next few slides is our approach, which is at the bottom, which performs considerably better. You see that its response time is almost constant across all the different scale sizes, and it's because we're cheating, we're actually doing the same thing that materialized views would be doing. And that's what I'm going to tell you about for the next couple of slides.

Okay. So the idea behind representable views or these materialized views is kind of a crazy one. What we're going to do is we're going to toss away those lineage formulas.

And to get an intuition about why this is so crazy, these things are essentially C tables from the '80s.

So since the '80s when people were worried about C tables and query processing to the

'90s when Fuhr and Rölleke were doing their probabilistic IR to 2000 when Trio and

MystiQ were computing these lineage formulas, everyone's been pathologically keeping these formulas around. And what we're going to do is we're just going to toss them out the window.

And we're going to toss them out the window, I mean we're going to compute marginal probabilities and we're going to pretend that the tuples in the database actually aren't correlated. So we're going to try and answer queries using the table on the right as opposed to the table on the left.

Now, of course there's a catch. Those lineage formulas were doing something. And what they were doing was telling us about correlations. So what we have to prove, the analysis

I talked about is that we're going to prove that evaluating a query on this table on the left is the same as evaluating it on the query on the right, on the table on the right. And the idea is that that table is much smaller, much simpler, and so we will be able to get the kind of performance we saw in the last graph.

Now, you may be asking at this point how can this possibly be, right? How is it possible that there are some independencies that hold in this table, right, that we can take advantage of. And let me just show you that in this cooked-up example you see that in fact there are no tuples that are shared between this VLDB tuples and the SIGMOD tuples.

So what I mean is, for the experts, that we want a static analysis on that view description that I said was so important earlier to prove that no matter what happens in the output, these large-scale independencies will hold. And the intuition is that if a query then comes along and only touches one tuple from each of these bins or these independent blocks, then its value will actually be correct because it's only computing with independent tuples, and that's what we're assuming on the right-hand side.

So we want to do exactly the static analysis of views. Now, just a little bit of terminology. If this holds for any conjunctive query Q, any query can be used in this way on top of it, I'm going to call that view representable. And the naming comes from the fact that that means that essentially this tuple on the -- this table on the right-hand side looks exactly like that simplified representation I showed you before. That's representable inside this BID system.

Okay. So this gets us to the key theoretical question we need to answer, which is given a view description V, is its output going to representable for any input database. And for the experts in the room what this is is actually a closure test for C tables.

Now, depending on your persuasion, this problem is either wildly intractable or totally trivial. If you're theoretically minded, you may look at this question and say it's a property of an infinite number of databases, an unbounded number of databases. It's not even clear to me that it's decidable, because we're just looking at the property of the view description.

Now, it turns out since V is a conjunctive query it has what's called a small model property, and so in fact it is decidable. But it is wildly untractable. It is complete for this class Pi2P, which is NP-hard. So the theoretician's right in some respects. This is very difficult.

But the practitioner says on the other hand all these things about query containment, you told me these things were very, very difficult and very hard. But in practice actually we have heuristics that seem to work very, very well and we don't ever have to resort to the complicated theory.

So what we did here is come up with also an approximate test which is sound but not complete. And actually we can prove that in most practical cases it actually is complete.

So the practitioner is right as well. Actually in practice, we can solve this question using a simple test based on unification. Very similar to how containment is solved.

Now, a second thing I want to tell you is that a representable view is one for which every query gets the right value, but it also makes sense to consider the pair, the query and the view together.

And what we call this is a partially representable view, when only some queries are safe, if you like, on top of it or easy to evaluate. And we also have a complete theory of partially representable views, which is obviously is superset of the theory I told you about before.

And for the experts in the room, the key idea here is that we're generalizing critical tuples, so we're generalizing what are like thought of as Boolean derivatives, and that allows us to conclude essentially both of these tests. And we have a full theory here of the complexity and polynomial time approximations. Please.

>>: Seems that what is important is not only to find out if a particular view is representable or not, it's also to find out how often is it a random view [inaudible] if most of the views are not representable, then there's no [inaudible] defining [inaudible].

>> Christopher Re: That's an excellent point. So one of the highlights of the partially representable theory is that every view is partially representable. It's not true that all views are representable, and I didn't include the graph here, but it's in the paper.

Actually it turns out that most real-life views -- for example, I took the views that were inside that company iLike, that they were computing. They weren't using probabilistic databases, but you could very easily see where there were imprecisions in their view, they

were combining similarities and so on. I think it was like 80 percent of their views were actually representable or could benefit from these techniques so that there was some amount of work that could be saved.

But it is true that not every -- it's not like materialized views for standard regular databases where you can materialize anything you want but maybe it's not updatable or something like this, or not easy to maintain. Here you can actually materialize every view. It's a little bit stricter test. But it's because those correlations are messy to deal with. When you can, though, it's extremely efficient, you just toss the lineage out the window.

So that's a good point. So the practical, we've gone down a little bit of a theory hole here -- oh, please.

>>: [inaudible]

>> Christopher Re: We have considered updates. So I was just mentioning this in the context of that's where the materialized view stuff starts to get very interesting, selection and updating. But we faced an even more basic problem here for probabilistic views. I was just trying to emphasize that.

So we went down a little bit of a theory hole here. There's a long digression about Pi2P hardness and all these things, but remember this was a very practical thing we were motivated by. We wanted to get that graph that I showed at the beginning where everything was flat and constant. And actually that's what's going on here. That review was representable and the queries on top of it of course could then be processed much more efficiently. So that was the idea here. Please.

>>: Is there some connection between the present [inaudible]?

>> Christopher Re: So they're actually orthogonal. So there are examples in all categories that you can think of, things that are safe but not representable, representable but not safe, safe and representable, not safe and not representable. So there actually are orthogonal concepts.

There does seem to be -- so one interesting connection with safety, since you've brought it up, is that you can have a query which is unsafe. But after you do some materialized view, you take some subquery and you materialize it, then the remaining part of the query, if you like, if safe. So answering the query on top of that view.

Now, why is that so helpful? Well, you could imagine computing that materialized view offline and doing all the hard work offline, and then your online query processing is now very cheap. So they are somehow complementary techniques to one another that can be used to really build up large-scale performance. That's an excellent point.

Okay. So let me summarize the technical contributions that I've told you about and give you one-sentence summaries of the other two. So I told you about top-k processing. And here we had the theory, so here the red and blue are back again and the hammer is on top

in case the red and blue were understated. The red says that there are two approximations. Right?

So there was a theoretical two approximation that we showed for the multisimulation.

The practical impact of that was that we were about two orders of magnitude whenever we could process these top-k queries. So for the special set, we were able to get close to relational speeds.

For the materialized view problem, we solved this closure problem. In practice what this allowed us to do was materialize these views so we could scale to gigabytes and gigabytes of data. Please.

>>: [inaudible] conduct the queries top-k [inaudible].

>> Christopher Re: Yeah. So I'm addicted to them. Conjunctive queries will make an appearance everywhere. Yeah. Exactly right.

The third contribution which I didn't have time to tell you about, but I'll just summarize in one line, is that we showed this idea of approximate lineage. So the idea here is rather than just tossing away all that lineage like we would do for materialized view, let's keep some of it around.

And the art of this work was that we're going to keep around just the important parts.

And what we proved is that there are always a small important set that we can keep around that always gets us epsilon close. That's always a good approximation.

And what this allowed in practice are orders of magnitude compression, hundredfold compression, which allows us to scale up and is orthogonal with the materialized view approaches. Please.

>>: [inaudible] will you show the graph query [inaudible]? Is that [inaudible]?

>> Christopher Re: No -- so yeah. So we did something that's standard in the area but is very abusive outside; we knocked out all of the aggregation. So we took the shape of the query and just made all -- converted all aggregations into exist. So anytime you saw a count or a sum, we replaced that with an exist or a distinction, if you like, to generate queries, because there aren't real work loads that -- and the ones that we know about that are more credible we can't publish. So it's an excellent technical point. Yeah.

The last contribution was going back to the earlier point about aggregation processing.

And here we have a series of optimal algorithms, and I discussed a little bit about the fact that for aggregation queries, like with sum, count, and average, things like real SQL aggregates, we have some things that are akin to safe plans for those type of queries.

And for those of course the practical benefit is that you'll get SQL-speed processing whenever those things apply.

And as I mentioned, it turns out that actually approximating them is hard, theoretically hard as well. So there's a more interesting theoretical landscape for aggregate queries than there are for standard Boolean queries. Okay. How much time do I have? So, well, okay, perfect.

All right. So I will actually spend one or two slides on other systems I built. Okay. So another system we built had the following observation. Not everything fits into the model that we talked about earlier, that very simple model. So what we want to model are things like sequential correlations. And what we were motivated here by were exactly that RFID example. So your location is correlated over time very strongly. This is important to model in the database to get the correct answers.

It turns out that these sequential probabilistic data also pop up in areas like text parsing and audio to text, okay, when you're converting audio and trying to transcribe it, you get very similar-looking probabilistic models.

So in the Lahar project what we want to do is manage that kind of data. And the idea was to be able to process regular expression-like queries in both real time -- so, you know, alert me exactly when someone entered the office -- and also historically, doing like mining queries over this data. And so an example would be alert me when anyone enters the lecture hall after the coffee room.

So sequential queries that look like regular expressions, but there's an anyone in there, so there's some kind of variable or templating like from data log. So combining the two.

So the highlights of this are that the challenge is really that we have to do these massive datasets, and again we use the same techniques: static analysis of the queries, sometimes improve sampling routines, and the indexing that I briefly mentioned.

And the contribution of this work was the first system to support these rich structured queries on streaming data. And for more information, please see the Lahar Web site. We have much improved videos than the one I've showed you here. They don't illustrate particle filtering, but they show you actually what's going on in the real deployment and there's lots of fun stuff that's on there as well.

Two other things I'll -- I'll put the other two up. There's the Dedupalog stuff with Arvind.

One other project that I worked on was as IBM which was much more systems oriented.

It was building an XML compiler for this thing called Galax, which was used inside, for example, cell phone towers by AT&T to check consistency of call records. And my contributions there were actually building the algebraic compiler inside of it, recognizing joins inside of X queries, things like that.

Also on disk layout and things for distributed query processing, to do data integration with some life scientist people at the University of Washington. That was my first project when I came in.

Second project which I'm happy to talk about a little bit offline is this declarative language for D duplication with constraints that was -- we did here last spring. The theory thing, just to hammer again on the red and the blue, the theory thing was that we had a constant factor or approximation for a large fragment of the Dedupalog language, so we did static analysis on the language to prove, ah, this is always within some fixed bound for quality, and the practical was that we could cluster sites here in around two minutes.

And if you know this literature, this is actually quite impressive. If you were to extrapolate one other well-known approach to the sites of your dataset, it would take on the order of seven CPU years. So this is quite an improvement going down to two minutes. And this will be at the upcoming ICDE in Shanghai.

Okay. Now I'm going to tell you about my future work and my vision. I have till 45, right? Is that correct? Is that not correct? Yeah. Yeah. So, I mean, it's fine. I only have a couple more slides. Okay.

So let me tell you about my short-term and medium-term work, which is pushing more on this idea of approximation, which I told you about briefly in approximate lineage, and also that idea of compression. So the goal here is to manage the exabytes of data that are being produced. And the observation is that most of the data that's going to be produced over the next five years is actually data that we can only understand imprecisely, things like calls, videos, images. So just because of the sheer number of data or because of the complexity of its semantics, we only get some imprecise data from it.

So since this data is already imprecise and we want to do sophisticated querying on it, and that's very difficult, perhaps we can trade off a little bit more imprecision in the data for a lot of query processing performance. And we've already seen one example of this.

This is, for example, in the Lahar system we did the approximate indexing and compression, and we're able to get orders of magnitude improvement and performance without too much degradation in terms of precision and recall and querying. This is a pretty obvious tradeoff, but it's a fundamental one.

A second thing is that -- I mentioned this; I just want to come back to it -- we talked about approximation for agate queries, these very data warehousing-style queries, which are obviously critically important to OLAP-style applications. And it turns out the high-level point that I mentioned is that these queries are actually provably hard to approximate.

So that nice sampling algorithm I showed you before doesn't actually exist for these queries. Now, these queries are very important. Like we need to do something about them. So I think that an interesting thing that I've talked to some people about is combining the kind of static analysis on the query we've done with heuristics from statistics, which has these very great rich theory of heuristics essentially that allow us to process very similar-looking queries. I think this is an interesting opportunity for future work.

I'll skip this one in the interest of time and just tell you about a long-term thing. So the general feeling that I have is that the future of data, as I mentioned, these large-scale collections of data, are actually going to be considerably less precise than the data we're dealing with now. And as a result, the data are going to be more difficult to understand and more difficult to debug.

Now, what I would like to provide is a database that's able to give clear, concise explanations on data products. And I think actually a bunch of projects in the area are actually already going in this direction.

So one example of this is a paper that was in this last slider that was talking about how do we transfer queries, the results of structured queries to simple text, so a simple text description of queries. And people are starting to make progress of giving a [inaudible] or a very small, simple description of what a structured query is computing.

The second reason to study these explanations is that this gives the -- sort of closes the loop with respect to the providence and lineage stuff. There are a lot of scientific databases right now where people are tracking where's the providence of this data coming from. But -- and what's the lineage of this information. But they don't offer tools to explain it to you, they don't offer a theoretical foundation to tell you this is a good explanation for why this is present. This is actually an idea that we found quite powerful in other areas such as data mining too. And debunking.

So this may seem very vague and it's very far reaching, there's no question about that, but there is some foundational work. So just as probabilities gave us a way to abstract away the details of all these imprecise sources and get to the fundamental data management like querying tasks, there's some fundamental work on causality and explanation that's grown up in the last ten years, particularly by Pearl and Halpern, that will allow us to treat all these explanations in a uniform way.

And the interesting part is that the competitive advantage I have is that these techniques are imprecise, right, so all the techniques that deal with causality and explanation actually rely on large-scale and precision processing, which is essentially what my dissertation is about. So it's a natural extension.

So the last thing I want to tell you about is some related work. This is -- and to be blunt, the message here is that probabilistic databases are important. There are a lot of people studying them. So here's the list.

So in the probabilistic discrete relational case in recent memory there are been three systems that are coming out. There's our system, MystiQ, which was the first; the Trio system, which did a lot of great modeling of how imprecision should be modeled and also was the one who identified the concept of lineage as a very first important first-class concept; a follow-on project is the MayBMS system by Chrisoph Koch where he has in some ways extended our results, for example, on safe plans to scale up to really large datasets as well. So at Cornell and Oxford with Dan Olteanu and some of their students.

Now, in this talk, I talked only about these discrete representations. It was in the conference or it wasn't in the conference. But for things like temperature, you want continuous representations. And people have looked at that as well. Full credit goes to the Orion Project at Purdue who are looking at this. Along the same time -- about the same time was the very famous Barbecue Project that was looking at acquisitional query processing for sensor networks and temperatures.

One very interesting project is this MCDB which is taking place at Rice [phonetic]

Florida and IBM. And the idea here is that forget about all this static analysis for the moment, trying to figure out which queries are safe, which views are materialized; let's just use sampling to answer every query, but do the sampling very, very efficiently.

So these guys are statisticians and they're making some very interesting progress doing these things, and Peter Haas and I have had some conversations about perhaps going -- combining static analysis with the statistical techniques. It's a very interesting and wide-open area to do this stuff at really large scales. Because the problem they have there is performance, sampling just is very, very difficult to scale to real SQL queries and real datasets.

There are also a series of approaches on graphical models that have come out over the last little while from the AI community mostly into the database community, where they treat the database query answering problem as an inference problem. This is certainly a well established and great area of work.

Another thing I wanted to emphasize is we talked about top-k processing, and I made this bold claim that top-k processing is fundamental to probabilities. And to back that up, actually there are a bunch of people working on top-k's plus ranking. I've listed some there. But every time you pick up a proceeding there's at least one or two papers that have to do with pick your favorite ranking, pick your favorite semantic and pick your favorite top-k and put them together.

And the reason is it turns out that it's a surprisingly interesting paradigm to support a lot of nice, fun applications. So people are looking at that.

And full credit should go to Ihab Ilyas at Waterloo and his student Mohamed Soliman who actually proposed adding ranking functions, not just looking for the most probable tuples as was mentioned here, but looking for a sort of arbitrarily best tuples. And it's surprisingly subtle problem in it's most general formulation. And there's also work on streaming probabilistic databases.

One I want to mention is the UMass project with Yanlei Diao. And here the idea is that they want to process radar databases. And they're very, very cool, but the problem is the amount of data is orders of magnitude greater than the kind of data that we deal with in

Lahar. So the techniques about particle filtering and all those things actually don't apply.

So there's an entire world of big data out there.

There are also very cool techniques from AI. We borrow their models extensively and we're trying to import some of their techniques, notably differential inference and lifted inference, which have very similar analogs to the kind of stuff we're doing and are both very awesome.

Okay. So to conclude. What did I want you to take away from this talk if you took away nothing else? The first thing is that there are a growing number of large imprecise datasets. There are a lot of these applications out there and they're profitable to manage.

The second thing is that if you're going to model imprecision, a promising approach is to use probabilities. And this allows us to strip off sort of the differences between these different applications and really get to the fundamental parts of them.

The third pragmatic thing I want you to take away is that it's possible to actually build these applications on really large-scale databases. We saw here dealing with gigabytes

with data which is difficult for probabilistic processing, but we were able to do it. The key challenge to get there was performance.

And my contribution was really about scale. I told you about materialized views, I told you about top-k processing and those things. And the theme of my work was always doing some amount of theory or some amount of formal work to get some practical results, and here the practical results were always that we wanted to process on gigabytes and gigabytes of data.

So thank you very much for your time. I appreciate it. You're a great audience.

[applause]

Download