24168 >> Paul Larson: So we are pleased to welcome today Peter Triantafillou from the University of Patras. I've known Peter for quite some time. I don't know exactly how long. >> Peter Triantafillou: Don't want to say. >> Paul Larson: No. Peter has been at the University of Patras for about ten years, and he was at the Technical University of Crete before that and before that Simon Frazier. And before that University of [inaudible]. >> Peter Triantafillou: Right. >> Paul Larson: Peter has worked on a lot of things during his career, but mostly I would say it's been things that involve distributed systems and data management, file storage, that kind of stuff. Managing data in a distributed context in various ways. Plus other things. Today he's going to talk about how to extract more value out of key value stores. Welcome. >> Peter Triantafillou: Thank you. Thank you very much. Well, thank you for the invitation. I'm really happy to be here and present some of the stuff we've been working on the last year. When I say we, don't want to forget to mention my colleague Nikos Ntarmos is a former Ph.D. student of mine and Ioannis and George are basically master's students working with me at the University of Patras. So the title of the talk is a bit iffy, complex queries, value scores. So what we'll try to do is first I'm going to present the overall framework, what the philosophy is behind what we're trying to do and that saves our approach, and then go into a bit of more length into two particular types of queries that I think are interesting and see how we can process them efficiently over key value stores on the cloud. And I will end the talk using, presenting four or five slides, referring to the major conclusions, and the things that are interesting that we think we've learned from this on top of another solution that's good for this particular type of query and it's fast and so on and so forth. So the general framework is that we're talking about data management services on the cloud. And they come in the form of queries and various complexities and we typically have this well known trade off between storage cost and query processing times, and the idea here is that we would be willing to pay a bit more in terms of storage cost if that could save us money in the long run, and talking about money, we think the value to the enterprise will come from query execution. So the more queries the better, the more money that would be made, and so the idea would be that either the provider or the client comes up with smart ways to invest in storage, to build up interesting indices that would help them expedite the query processing. A few things about the overall driving philosophy of the work is we want to do some work towards real time queries on the cloud. The state of the art falls short of this. We've seen the last four or five years papers published in the major database conferences trying to do queries such as joins and other things that are basically use MapReduce in one way or the other to accomplish the task. A good way to describe the talk would be no MapReduce for noise throughout query processing. Don't quite believe that, but towards the end I will mention what I mean by that and to what extent I believe that. So the idea is the basic driving philosophies, when we want to do MapReduce we want to build indices, the question is how to build indices, we want to build the right indices and the key design, one of the key design decisions for us is to have simplicity. Simplicity design of the index, simplicity in the way you process the index to come up with the answer to the query. And, of course, we always realize that what we're going to do is going to be plugged into a big system. There's a lot of smart folks out there coming up with valuable things in terms of infrastructure. So we can use that. We can basically piggyback on that in order to come up with a better system. And, of course, efficiency scalability always play a big role, and I'll describe, I'll talk about that later in more detail. So the first part is a bit more details about interval indexing and query. So there's a bunch of applications out there. I mention a few here. But basically refer to temporal queries or interval queries in general, whether it refers to time or not, independently of that. So there are different types of queries out there. I call them intersection queries such as in a Web archiving system, I'm interested to find out pages that started or finished within a particular time interval. For analytics reason or I'm looking for events that are completely contained, the events span a specific time interval, this time interval is completely contained within another time interval or I'm looking for a security event. Say we have a terrorist attack, a word that spans over five hours and I'm looking to find what industry went on a week before the start, including a weekend after the start. So these are what we call containing queries. In general the query types look like this. It's an interval begin and end time point. So the first part are the sort of containment, the contain queries. It's obvious what it means. These are the contained queries are these green -- is this so? It's over there. The green intervals. The intersection queries are basic queries that are crossing the query interval either from the left or from the right. Those are the purple intervals. And a particular type of interesting query is so-called stabbing query where you basically have a single point with a begin and end time points are the same and you're going to find out this particular time point which are the intervals that are being crossed by it. So what we will be doing is we're trying to come up with indices and queries processing algorithm to use these indices to expedite answers to these queries. We're talking the basic infrastructure that we are assuming are key value cloud stores. It bases the reference. The basic idea here, I'm sure most of you are familiar with this, we have an HBase master. These are basically tables or parts of tables if you wish and all of this are data nodes which are called region servers in HBase lingo, and inside one of these there's a men's store. These key value stores are optimized for all write throughputs. All writes go through memory and eventually they're being sorted through these particular file formats with particular indices and dumped into the Hadoop system in the file, general infrastructure that we're going to be working with. So the first question is we want to support now these interval queries. Is there a native support provided by HBase to do this. Here's an example I'll be using a running example through this which is a Web archiving scenario. This is most of our data -- this is where most of our data comes from. And this is actually what's paying for this research, because this is a big European project in terms of about four million Euros over three years. It's basically four partners throughout Europe. And Patras is in charge of basically coming with indexing part and the query processing part on this indices for that. So here I'm having a region server, a data node stored in different regions or tables. They're all key could be the URL of the site or whatever. There is a whole bunch of other things we don't care about for purposes of this talk. And we also have a begin and end period. This could be whatever. This could represent different crawling times where this thing was alive. Every time you get a new crawl of the same page, basically a new timestamp starts when the page crawls that's when interval ends so on and so forth. And in general we have a beginning and end time point associated with every row key, and the row key here is a URL. And similarly here, so if I have a query that comes into the system, that basically says give me all the interesting things that happen in this time period, then if you look into the times you'll see that it hits on both of these region servers, and what can HBase do with this? Basically you can run a filter. You can do a get operation specifying a filter, and basically the filter is a predicate that basically says when you grab a row, look at the beginning time intervals to see if they're interesting with respect to this query interval. Okay? Pretty simple stuff. Nothing great. The bad thing here is that obviously this is inefficient and costly. Why? Because it grabs every single row. It applies the predicate. If it's okay it keeps it. Otherwise it goes on and on for all the region servers in parallel but still it has to touch all the rows. And especially for queries for loss selectivity. Selectivity is an issue here. I'll touch on that later. And we have quite a few of those loss selectivity queries. The question is basically can we expedite this? So there's a couple of ideas here at play. The first idea is what we call a time point, an endpoint index. Here's my row data. Again, I have a row key which could be the URL or whatever. It's how it's stored in HBase and I have a beginning and endpoint for each one of those different row keys. There could be other data here. We don't care about it. That's why I'm not showing it. So what we're doing basically is we're coming up with an additional table if you wish. Call it a family. So we grab every row. This is done using MapReduce processes in parallel and what you're doing is for every different endpoint that we see, we create a new row. And the key here is this particular value. Right? So there's a row now in my table with value 1-1-1-2010, whatever. And also in this other column family that I've created here, I'm putting in the corresponding endpoint. Okay. Similarly, I'll grab this. The endpoint I will create a new row for this. And the other column family I'll put the left end point of the interval. Pretty simple stuff. So I will keep grabbing all the rows, creating all these new rows and this will basically serve as my index. This segment of the table is the index for answering interval queries on my row dataset. So let me give you an example of this. So this is my index that I built over the previous example. And suppose I have an interval query that comes in that basically says I'm interested in this time interval, tell me what's interesting there. Okay. So there's a couple of things going on here underneath, first thing is by design HBase can do really quickly scans. Scans of rows. So by placing everything by row ID here I could really identify segments of interest really quickly and get all the rows really quickly, riding on what HBase provides me already. Okay. So if I look at the endpoints of this interval in the query. Then this basically gives me a segment of my endpoints index. And the idea here is while looking at the results, for example, I see there's different things going on here. For example, when I see there's this item B both starts and finishes within the segment of the index, there is this item A that's basically a crossing query, crossing interval from the left because it started over there and it finishes here. There is also this thing C that started before and it finishes with this query interval. There's a couple of things that starts within this particular segment and finish the outside of it. Okay? So the results that I get from looking in here is A, B, C and F. I don't get D. And D's also part of the result in the sense that if you're interested in the containment queries. Okay. So in order to get D as well I will have to look either to this side or to this side. The idea would be by scanning my index from the beginning of time until the end of the segment, specifying the query, I would be able to get all the interesting things that happened. Or similarly, by scanning my index from the beginning of the query interval until the end of my index, I will also be able to get everything. And I would rely on the fast scan operator of HBase or any key value store to provide me with the answers. Note that for a particular type of query, like the crossing queries from left and right, or the queries that are completely intervals are completely contained in the query interval, then this thing is really fast. Typically it gives me a small index of, a small section of my endpoints index and I can get all the relevant data pretty fast. If I have to do a stabbing query I'll probably have to scan huge portions of this, and this is going to be huge. Okay. So that's the take-home message from here. So we've come up as a second help for this, we looked into the literature as to what our data structures and there's a lot of data structures here. Indexing methods to be using for interval queries. So we selected one. That's one of the prototypical segments for doing this. And it's called the segment tree. And what we've done is we have provided a key value code. A key value representation of the segment tree on the table. And we're using MapReduce faces which actually optimized in ways I hope I have time to point out in order to create based on their own row data table that's given to us stored somewhere in HBase, run this MapReduce processes and come up with a key value presentation of the segment tree which we call MRSD thing. So in terms of computing the elementary intervals which are needed in order to process the segment tree queries it's pretty much -- it's not particularly difficult. What we're doing is we're splitting the space and we're giving this data items to two different mappers in this example. These mappers are basically producing all the different endpoints in the beginning and end times, and some stuff is going on here so everything is shorted. And in the end what comes out is what was stored into HBase, once we store in endpoints basically elementary intervals. And based on that we have a second -- oh, by the way, here we also have realized that we have all the data that we need to build the endpoints index, we build the end point index as writing it as a separate table in HBase. So now that we have the elementary intervals, we feed them into a second phase in a MapReduce process in the different mappers, and each mapper will basically create its own binary tree on top of the basic atomic or elementary intervals. Again, this is not particularly difficult. And then we join all these different separate that's being produced under one heading, under the one common route. To give you an idea what it looks like, this is what the MRST index looks like in eights base. So let's go through it. So what we've done is we have this coding of the tree in a table and we basically faithfully follow the standard algorithms for using a segment tree in order to answer interval queries. Suppose we have a interval query, stabbing query at this point. We know what the root of the tree is, the root is here on this table. This table basically stores a row key, and the row key is what is the time point with which every particular node in the tree is identified. So when you build a segment tree every node in the tree is identified with a specific time point. For the root this time point is the median of all endpoints if they're sorted. For every sub tree, the root of the sub tree is the median of all the endpoints covered by that sub tree. That's the basic idea. And also for every node in the tree we have every node in the tree is associated with an interval. And the interval is basically a union of the interval associated with their children. With the nodes children. And this is difference of defined. We have these two pieces of items. The row key is this unique time point. And we basically have other information. Every node in my segment tree can have a list of different intervals that are placed on that node. So we have that. And we also have pointers to the right and left child of every node. Okay. Note here in terms of traditional index pointers, these pointers are just keys, different keys to my table. So when this query comes in, I'm comparing this date here with this date. It falls to the right of it. First of all, I'll grab whatever is in that node. It's a unique characteristic of segment trees that to answer stabbing query basically what I'll do is follow a route of the tree to a particular leaf. And know the segments that are stored with these nodes I'll collect them. They all continue to be part of the result. So I will compare this date here with this date, and I will decide whether to go left or right. This date here falls to the right of this. So we'll go to the right side. First I will grab the item stored there, and I will keep it from my result list. And then I will grab the right child and basically sends me over to this row and I will do exactly the same thing, compare this with that and this falls to the left child. There's nothing to grab there. So nothing's stored there. So then I would go to this and I will keep continuing the same process. So this actually has this node here this particular interval stored there. So I will keep the interval then I will go to the right child and that basically refers to a lift node it's elementary interval and there is also something stored there, so I will grab there and this is my answer. >>: Guarantee to be a unique [inaudible] tree? >> Peter Triantafillou: Yes, because these are different endpoints, right? There's only one endpoint. And this actually, I need that because to define them as HBase keys. All right. So this is basically what's going on here is basically a traverse tree. Every node in the tree is a row in my table. And the cool thing here is how many traversals I'm going to do, logarithmic with respect to the number of things I put there. So that gives me a predictable bound as to my latency. Okay. So interval queries pretty much go the same. I'm showing this example to illustrate something which is another characteristic of segment trees which is the bad side of segment trees. Again I'm having an interval here. I'm going to my root and I'm comparing this interval and I see that all of this falls to the right if I remember correctly of the time point associated with the root. So once I grab D for the result I'll go to the right child. Then going there I will continue. There's nothing there to grab. So I will basically see that this interval now spans both the left and the right children. So I have to descend both down on both of them. Okay. So then I grab those, and you can basically, the process follows. Have to keep track so I don't miss anything. So I collect all the data. The point here is that if I have interval queries of slightly, big length, depending on how my segment tree is built, I may have to decent too many nodes in the tree and that will help me. Every descent I do, every sound I visit that is a get in HBase, that cost. Stop at this grade because I do I don't know 20 or 30 gets and I'm done and I grabbed everything. All right. But when I have to grab a whole portion of millions of nodes doing millions of gets for a particular query this will kill me and we will see that in the performance results. So now I have both indices. Okay. Which one is better? And it turns out this was actually, didn't just turn out, it was by design after spending some time thinking about these data structures, is that we can get both of the best of both indices. But what we're after here is to realize which query type is good for which index is good for what query type. When a query comes in, send the query, route the query to the particular index. And if the query is too demanding, is complex and wants both containing and contained and crossing like we said, then decide which index to use for which part of that. Okay. So this is what's being played out here. Okay. So basically for an interval query ABI do a quick scan on AB. Typically this would be small, identify small section of my endpoints index. And but as we pointed out in the example earlier I'm still missing something. So to get what I'm missing basically I'm doing a stabbing query on MRSD. By doing the stabbing query there, I will get the missing parts. Okay. I'm getting the missing parts. I'll get some overlap but I can filter those out every time I visit a node. So if I am again to to all these intervals, then the purple intervals are the ones I'm getting for API and the green intervals are the ones I'm getting from MRSD. Okay. So this is the idea. So the next logical question is what happens now with updates? And this is a big problem, right? So what we're trying to do here is basically piggyback on to the basic [inaudible] the values right throughput. Basically in all of the work in the work that I'm going to talk about, layer -- I'm indexing a value and I'm having things added to it. So adding things to it to associate with particular value just adding another column with this. Okay. I have write throughput for that. That's given to me. Okay. I don't have to worry about disk nodes being filled out or blocks or whatever. Again we'll get back to that. So what we're proposing to handle the updates. We're proposing to have this updates index. The updates index is functionally the same as an end point index. In other words, it's a table that I can scan in HBase. The key difference is that this thing is much smaller. Okay. So everything, whenever somebody comes in, let me actually solve this. So this is my regular endpoint index. This is my updates index that I have. So when I have to insert a new record with row TA and intervals between 9 and 19 I'll go to 9 and something here and I'll go to 19 and that's something there, right? This is how the endpoints was working, and I'll also do the same on the updates index. Basically the updates index is supposed to be something very small that's going to tell me I don't want to be building the tree. The tree building is a costly process. Segment trees most of the interval structures are static. So I have to rebuild them again from scratch. What I'm doing I'm dumping all the updates in the small index here, when my update comes I'll rank in parallel. I'll get the old stuff plus the new stuff that's been added by doing quick scans of the small index here. That's the basic strategy. So if I'm to delete something, I want to delete record with key X, so I'll go and delete it from the endpoint index and I will basically add tombstone records for that. So when I scan my parallel, in parallel my updates index and I see a tombstone record and I go to the tree and it gives me interval X I'll use a tombstone record to filter that out. That's the basic idea. In terms of stabbing query what's going on, how we process a stabbing query now or interval query now that we have this updates index. We run the query on MSRD. Why because it's very fast. I'll do logarithmic number of gets and at the same time I'm scanning the UI from the beginning of the UI index until the query time point. This is small. The scan is fast. So I'm not going to pay a lot. So then I'm looking to add things that I missed from the tree or things that have to be removed from the tree. And if I have an interval query then I can run my query on the tree or I can run it on API or both and I get in parallel. I go and grab what's new in the updates index. Recall the updates index is small. So it doesn't hurt performance and we've tested that. I'll show you some results. So in terms of some experiments, we have crawls of these domains. Not particularly big. So we used three, five and nine node clusters, in two we report all of our code there. And in terms of the algorithms that we implemented is our algorithms for building and not addressing the queries for query processing, the support, the native support for HBase provides and we implemented high running on HDFS. We also did it on HBase but that was way too slow. Even slower than hive. And here is some results. I mentioned that what we're building we have this MapReduce spaces we're building the structures and we have a simple version and an optimized version. If we see just an optimized version, the take-away message is we see scaleable performance. As we increase the number of nodes going from two to four to eight data nodes we things scaling nicely. We see big improvements when we go to optimize instead of the unoptimized version. So these are fairly big improvements, and again we see a nice scaling with when we throw more money into the problem by bigger clusters. In terms of stabbing queries, this is basically what's going on. This is the HBase filter. This is what we get with hive. It runs basically huge MapReduce jobs behind it. Again touching in parallel, though, every row in the system. Here's what we get from API only. If we execute the query, API only, if we execute the query on API only on segment tree and if we use both we're not going to get any improvement from this. So again we see that for stabbing queries we get a factor of two or three better performance from running it on the tree. Here running one week old intersection queries what we did is we looked into the dataset and we randomly picked 100 queries, I think, with -- we picked 100 random one-week intervals. And because the dataset is very dense, this basically is a lot of intervals. In the particular dataset immediately you see what's going on with the tree. This is the effect I mentioned earlier, right? Going down the tree you have to visit basically a big portion of the nodes and you have to grab huge numbers of intervals, intervals from each one of them. Again, API stays consistently below a certain bound, and if we do both, that is I only do on the segment tree, I don't do the whole thing I do the stabbing query and do the main API and again things are improved significantly. >>: What's the database here? >> Peter Triantafillou: We're talking about a few million, from two and a half to six million intervals. >>: Up against each record? >> Peter Triantafillou: I don't remember to be honest. The question because we have different versions of this. But I think what we did was we just put on to the tables an HBase only the attributes we care about. So we don't just have all the records there from the original dataset. If you were to put that in, it would be because they have the text associated with this. So it would be huge. So, again, pretty much so this is the one week old for the DMOZ dataset again we see the same scenario. The tree really is good performance for stabbing. If you go to -- so here what we did for DMOZ, I forgot to mention, we created synthetic queries to be able to play with the sensitivity of the selectivity of the query. So we designed queries that we knew manually were going to give us 25 percent of the whole dataset to see what's going on, and we see again the trends we have seen before, both as a clear winner if we run the query, the big interval query on both indices and get the best of both. And running to 75 percent selectivity in this particular dataset again we see pretty much the same behavior. Okay. So last set of experiments is when I'm basically running interface queries interface updates. Now the new player in the scene is the so-called updates index. So what we've done we assumed the queries, when the queries start playing a certain percentage of what was originally in my indices are now being stored in the updates index. And this percentage was varied from five to 30 percent. So 95 percent here basically means that 95 percent of my data is in the tree and 5 percent of the data is new data that's not in the tree yet it's in the update index. So when a query comes in remember it goes parallel. It goes on the tree and goes on to the API or it has to go in parallel to the updates index to get the new stuff. So the update index was varied up to 30 percent of the original size of the tree or of the original number of intervals. And, again, the thing to note here it's pretty stable. So we don't get big -- any deterioration going on in terms of time. Because in waxing in parallel the update index. So it's pretty standard. We see the behavior everywhere. Looks like differences but they're really in the statistical range. So it's not really big differences if you look into the Y axis. Okay. So in terms of the conclusions for this part, this is a first crack on doing interval queries in key value stores. The queries are becoming more and more popular. Basically driven from the need to temper analytics and all kinds of things like this. So we came up with a few indices that can help us solve the problem. We build them with MapReduce jobs. We have index processing, query processing algorithm, utilizing the indices, index maintenance scheme that is not lock out queries and the queries see the new things that have been put there. And we've seen big performance in terms of the native support or in terms of running MapReduce jobs. >>: So you're reporting latency but you're not reporting cost, and not reporting throughput. >> Peter Triantafillou: Right. >>: What do you say about that? >> Peter Triantafillou: Okay. So the ->>: Because pile latency is kind of -- pile latencies tend to be rather large. [inaudible] is kind of a ->> Peter Triantafillou: You're right. For example, in the other part of the work, we explicitly also reporting bandwidth, which can be an indication of the throughput that could be achieved. Here, of course, nothing else was running into the clusters, right? So the thing that was running into the cluster is basically the hundred queries. So I have a good feeling if I were to report throughput I would basically be getting the same behavior because of this. So if a particular index or a particular strategy is very resource hungry it would not be showing here. Because there's a lot of things going on concurrently. But you're right, if I were to set it up in a real cluster with a lot of things going on, something that basically waves bandwidth around, basically would kill my throughput. So there would not be -- there's no serializing point like the performance. Any other questions? Yes? >>: What's the -- each query, the database would have an introduction context, here provide any transaction, transaction context? >> Peter Triantafillou: Okay. So this is a big discussion. So the question is what kind of consistency has been shown here? In terms of the regular queries is whatever the cloud gives me. Basically what HBase gives you is rate consistency. So in terms of our update it's also to see we can solve recommit consistency. So that's about it. Now, there is a lot of work going on that basically is trying to provide snapshot isolation, consistency semantics or even pure [inaudible] semantics within an HBase and that would be easy. This is part of why we did this, right? Anybody underneath the infrastructure give us another way to define transactions and help, for example, serialize our index updates with the raw data, fine for us. We just hope that. Because everything is implemented in HBase lingo, using the APIs, doing everything. Okay. So the part two I've got about 25 minutes. Part two refers to run joint queries. This crowd is probably knows all about it. This is a typical [inaudible] template for this. Typically we're talking about an N way join. There are different models. Most models assume that one of the attributes in your table is being used to define this core attribute for a particular record. Okay? And the key point is that when you join in TUPLs, you have to compute like an aggregated score that comes from the two relations that are being joined, and this is typically a monotonic application or summation or something like that, be using summation without loss of generality here. So what we have are two different -- again, techniques to do this. The first thing we're going to do is there's been some work on the centralized -- this is an aside, there's been some work on the centralized run join algorithms. The most frustrating thing when I was reading those works was the fact that they don't do the simplest way of doing this in a decentralized environment. It's a very -almost straightforward way to be able to do this in a distributed environment. And all of them do very complicated sampling approaches with truth mathematics to see that the sample is correct with the types of guarantees, but I bet good money that they would lose against the simple approach. So let me tell you what the simple approach is and what this more sophisticated bloom filter histograms, which is the normal statistical structure for N joins that we're coming with but that's what those are all about. What is the basic idea? The basic idea is I have my raw data. Okay. Some records, some row keys. Here's my joint attribute value, and here's my score, plus other things that I don't care about. So the basic thing here is the record key. Whatever that was designed to be. So I'm building a inverted score list. There's basically this thing but built in terms of score. So my row key here is a score. Okay. And so this is a rehash of this. Then the vertex score list. The basic idea we're building this, inverted index, right. Based on score. Score embedded index. Build this using MapReduce. We start fetching batches of rows in this inverted index list. I have two relations I want to join and compute the top K join. I'm going to go to every ISL of this relation and I start bringing in batches of rows. And when I bring these batches of rows I can perform your favorite algorithm for centrally producing a top key result. Top K result. In this case the alias will be all three algorithm which is pretty much a standard in the area. So when I bring a new batch of rows, I will check every row there against all the rows I brought previously to see if there's a join. And if there is a join I will compute the aggregated score for the join, and then I will check to see if there's a chance that any other records that I have not brought yet can make it into the top K result. Okay. And if so I will continue to fetching batches. If not, I will stop. This is basically the Alah Fagan [phonetic] threshold algorithm applied there for the traditional top K query. >>: What needs to be true in order for you to be able to stop? What needs to be true about the aggregated ->> Peter Triantafillou: Has to be monotonic. You're right. I wrote that but I did not mention that. So here is a typical example of how this works. I'm assuming here that the batches are just -- I'm bringing a row at a time. So I'm going now -the first rows I'm looking into the join attribute values. There is no match there. So I'm going to bring the second row and I'm going to try to match this with both of these rows there. I'm seeing a result in value segment, there's a join. I compute the score by adding the different scores there. So that's one result. Then I'm also going to join this one with the first, from the other relation, and I'm seeing a join or an attribute value 12 and I'm also aggregating the scores there. So this one is obviously the highest one. So this is my top K, my current top one join result. And if we look closely we'll see that nobody else that I can bring from this row and down can have a score higher than 183. That's why I stop. This is the threshold criteria. Okay. So this is pretty simple stuff. And we'll see that it's working very nicely. And you can do this in any distributed system. Even with DHD. I've seen peer-to-peer systems for this. It's real easy to do. You choose what other people have done for centralized processing you bring it in with batches and it works. So the bad thing with previous algorithm is that you bring in TUPLs, and you don't even know if they're going to join. Right? So depending on the distributions of the scores and the join attribute values, this may really kill you. Okay. So the goal is try to bring in only those TUPLs for which you have a pretty good guarantee they're going to be on the join result. And here's where our statistical structures come into play. We use histograms where the package of the histograms reflect score ranges. Okay. And what we're going to be putting into those histograms are join attribute values. All of it will become more clear in a minute. And I don't want to just keep the frequency of the histogram because then I have to make assumptions upon the distribution of the join attribute values within the score range, and I don't want to do that because in practice those do not work very nicely. So what I'm going to do is I want to keep every single attribute value that went into a particular [inaudible] filter sorry went into a particular batch. But because this can be huge, I'm going to use a bloom filter to summarize this data. So here is what the structure looks like. I'm having two relations here. I'm showing only the join attribute value and the score value. So if I were to build a bloom filter histogram matrix we call it for R1 what are we doing? We're processing for every row at a time. So we see that the join value is A. The join attribute value is A. And the score is one. So I'm having a bucket for every score range. Every tenth of a score. Say scores are normalized in this case. Okay. So this here is a bloom filter representing the contents of the bucket. So I'm going to hash A it's going to get into this position in a bit in the bloom filter and settle that bit. In this particular case in this particular example I'm using counting bloom filters -- we see from the score it refers to the first packets. I'm going to go to this bloom filter and set the bit that corresponds to the hash of C. Okay. So I keep doing that for all of these. Here .82 foster the hash packet. If you hash B, I go here and set it for the second bloom filter. This is the basic idea. I keep doing this. I thought I had skipped that animation. So please bear with me for a minute. Okay. So and I've also built similarly the equivalent buckets using bloom filters for the other relation. Now, what's going to be going on during query processing is I'm going to be fetching a bucket at a time. Starting from high end buckets. First I'm going to fit the bucket that refers to the everything that has a score between .9 and 1 from this relation and this relation. Looking at those bloom filters if I do a bit wise end I know there's a join. Here there's no join. I have something that passes here, some that passes here. Okay. No. So I don't have to build anything. I don't have to bring anything. Now, when I go and build this bucket here, I will compare it with the contents of this and do a bit wise N again, and I will see here that in this position I have two TUPLs of the second relation that has a value for this position for the join attribute and one TUPL from the first relationship that had a TUPL with join attribute thrusted from here, so I have a join result here. Actually I have two join results. I have a TUPL for this one with half value B in this case and one TUPL from that one with half value B. Okay. So I have a join out because B is the join attribute value. And I have a TUPL from here with join attribute value B and two TUPLs from joint attribute B. >>: Why? Because these are not exact? >> Peter Triantafillou: I have a counting bloom filter, but you're right there are four positives. So could be a bit more. So it's 2.02 or something. I'm on the safe side. I may bring something more but I'm not going to miss anything because of the false positives. Let's forget the false positives because that's actually a big discussion here. Here I have a structure that can tell me when I have a join result. What I need now is I can also associate with these two join results a high score and a low score. Using basically the score range from the buckets. And I use those for my threshold criteria to decide if I need to bring more buckets. Okay. So this is how it works. And so the idea is we create a bloom filter for each one of the score ranges that we care about. And we store it as one column in a row. This is a blob of bits. Actually, it's a big discussion here if you're going to have counting bloom filters. You're going to have plain bloom filters, how much hash functions you're going to use and how much compelling algorithm you're going to use here. So we've done believe me a lot of work here, and we're actually going with plain bloom filters with column compressed which we use a column compressed setting presentation of them. I don't know many details. My colleague Nikos found all of this. I designed the structure of the algorithm. So the other idea then is we need a diverse mapping. Let me just explain what that is. So here I have my relation. So I'm hashing these values and I'm putting them into my bloom filter and here are different positions. So into position seven I had D classes into position 7, for example, and I'm keeping track I have two things that pass into here. But I also maintain what we call the reverse mapping. In other words, when the person who, the query node that collects these bloom filters sees that at position K, I have a join result, there has to be a mechanism that you can go back and say give me whatever hash into position K. So here basically I have a table where the keys are, the position indices. They're nonzero position in my bloom filter. So if somebody comes to this guy and says give me whatever hash position a hundred bloom filter because I have a match, he can go here and grab all the relevant TUPL information, such as the TUPL ID. The joint attribute value and the exact score. Okay. Again, this is a nice table an HBase. Get quick gets, multi gets and all of that. So the algorithm works like this. We fetch a bucket from each of the tables starting from the high end of the range. We compute basically an N of all the different bloom filters and we use a score to figure out whether we need applying the threshold algorithm to bring more buckets in. If so, we repeat the process else we stop. So the join here, there is no join yet, which basically do a bit wise end and we see whether or not we have enough results for the join, and where we have enough results for the join we go back to the table and say give me all the TUPLs, the TUPL IDs, hash into position 100, like I was saying earlier. Okay. So again for the data, for the performance, we took -- from the TPC eights two relations the line item and the [inaudible] and create this synthetic one for reasons I'll explain shortly. We have pretty much the same setup, and we're doing top K joins where K is ten up to 10,000. We also have high MapReduce implemented too slow. I'm not going to bother showing it. And this simple ID with inverted score list, run join algorithm, use batches referring to 1 percent, 5 percent and 10 percent of the complete score mass. Okay. So here what the results look like, the green bars refer to the bloom filtered histograms approach, and here we see that the simple idea, the ISL idea works nicely. In the sense that at least there's always one configuration, in this case the 1 percent score mass configuration that always beats in terms of query times, the bloom filter histogram. Okay. In terms of bandwidth, though, and here's back to the point that you made earlier, in terms of bandwidth, this is not the case. The compression, of course bloom filters are summary sets. Anyway, we see from 10 to 30 times better improvement. And so this would tell me something about throughput and it would definitely tell me something about cost since you also brought up David, because most of the charge model charge you for whatever operations you're doing. So if I'm doing like, for example, the new dyno B charge model, if I'm doing an operation, I'm provisioning for a specific read through capacity as they call it. So for every read of one kilobyte I do, I pay. The more kilobytes you have to read, this translates query, this translates into bucks. So both in this it's not exactly what you talked about in terms of throughput and exact dollar values, but it gives some indication towards that. >>: So isn't this kind of an algorithm, why wouldn't this kind of an algorithm work, simply to sort of a normal cluster-based distributed database system? >> Peter Triantafillou: I think that it would. I think this particular statistical structure would work, yes. It just happened that the environment we looked at was cloud value, key value stores but I think it would look regular SQL databases. So now an interesting thing is looking at the score distribution, this is actually the different scores, and how many TUPLs from the original relations fell there. So the reason why the simple idea beat the bloom filter histograms was because it could stop somewhere around there. And it did not bring a lot of stuff before it stopped. So that's why it stopped. What it did basically we reversed. We flipped this around, this synthetic part of the relation, and indeed we see that this guy is always better even in time. In terms of bandwidth we show even big, up to 50 times savings. So conclusions. First crack on run joins. Key value stores. We think as David pointed out that the statistical structures can be used in other environments as well. The reason why we particularly like this is the ISL is a simple idea where everybody's comfortable in CS with building inverted indices and seems to be doing a good job. There are some configuration issues as to how much to bring with every batch. But again there is no system that doesn't have this tuning parameter problem. With respect to the bloom filter, the histogram matrix, we have bandwidth savings that translates into cost and the query times vary depending on bringing in the huge filters will actually pay off. So if the distributions are such that you have to go deep, you may have to bring the whole filter and that may actually not pay off. So actually we're looking in to find really bad distributions and negative correlations between the general value and scores, in order to see when this will not be better compared to the baseline approach. So the third part of the talk for about 5, six minutes that I still have, I want to basically go through a number of things that are beyond the traditional conclusions that one sees in this type of research, things that we've learned working with key value scores and trying to build indices for them. So what we do is we're having a key valuation of indices, that means whatever it is we put it at the table. Everything for us, everything in the index is an HBase table. We use the API provided by HBase or whatever key value to get that. So what is an index need? Especially for a key value environment, for this type of application. You need fast core additions to the list. You have value and something else that gives us value and something else that gives us value. This is easily done by HBase, right. Most value scores. You can easily add columns. You can do accesses, either for exact match, simple get operations, ruby operations or for scans. And again all of these are actually optimized, key value stores exist for these types of workloads. So they're great for building your indices. And this is actually a big departure from related work. Most of the related work don't bother building indices. You have a MapReduce job doing joins with different optimizations, or when they try to build some index there's been a couple of works that are basically too low into the block level of the disk. And then you have problems with updates or complete indexes disappearing or if you want to delete an index how do you delete an index. You're left with big holes in your actual physical storage of the data. >>: Is this NOD about going index, it's so costly to build the indexes at the start that you really have a hard time advertising that cost over the rate ->> Peter Triantafillou: You're right. This is a classical problem. Should I build an index for this or should I not build an index for this. One of the things we're looking at if for some reason you're to do a MapReduce job it's so long it would be faster to build the index first and use it for that query. But I don't have any hard numbers on this. It would definitely be the case. You can easily imagine this huge long running MapReduce jobs. Build the index for 20, 30 minutes, an hour. And run your query which would take a few seconds. It would be better than 15 hours of running with the MapReduce job. So that's one possible answer. But the idea what we really need here, and I'm trying to get to that later, is some kind of an optimizer. Right? Should I build the index? If I build it, should I use it or not. And I'll get to that in a later point. So this other thing that it's actually closer to my heart now is there was this old divide between main memory and disk indices. So there's a lot of smart people that work in computational geometry problems like CS theorists and they come up with nice data structures. The database never looked at those because they did it from main memory applications. When the main memory databases back in the '90s got a big lift some people started looking for them. There was this old divide. So my idea is if you're using key value representations of this most of these problems go away. There is no divide there anymore. Why? Because my index pointer just points to a row, and this row can have as many columns as you want representing the values that fit this particular row. So I don't have to worry about note splits and merging and fill factors for my block on disk, any of that. HBase takes care of that, and it does a good job of taking care of that. So ->>: So wouldn't that have some -- the details have some impact on your potential performance, or are you everything is washed out given the cloud of latencies? >> Peter Triantafillou: Partly. But okay so the potential performance is -- the potential performance is when I add stuff, I add columns. This is really done fast. Because key value stores do this. When I tree stuff just to do on columns which again there are physical indices that can do this really fast or do scans. So I'm not really paying a lot. So the whole point of building this indices is to surgically go and access specific rows or do segments of rows together, using scans, and the key value stores are fast for these things. >>: So I have trouble fully buying what you're saying because if that were true, then that would seem to be to apply to normal database systems as well. >> Peter Triantafillou: No, because there if you see I have a table I can just add columns to the table. It's a different data model. See here I can just add columns. -- okay. So I want to find out which items the attribute value A is associated with which items. As new items come in, just add new columns for that in my key value store. This is really fast. Just the main memory access. And when I get it I just go and get this value and I would get all of these new columns that have been added. I cannot do the same with SQL tables. So this is the new -- this way I think is an interesting, something interesting that comes out of this. And I can make divides like that go away. And the point is that there is a lot of smart people out there like I said that they really came up with this, but they're dismissed by the database community as being old school or only main memory. So we can use this structure and actually in other words we're using some of these structures for massive datasets as well. Not in the normal database sense but at least on tables and key value stores. Now, on big data indices and MapReduce, now MapReduce will trump your index if you have high query selectivities in other words if the result is a big portion of your data anyways, and they'll do it parallellessly; and if you're careful about how you're doing it, so there's no copying and reading back between the MapReducers and all that. MapReduce will be a winner there. So the speak is remember the cute title in the beginning, no MapReduce for NoSQL theories. It's mostly talking about complementing MapReduce and in fact our goal is -welcome to tackle this problem together with any of you that are interested to work towards an optimizer. I have a new query that comes in, what statistics do I need and how do I decide if I have a MapReduce job or build the index as said earlier and run the query on that. And when it costs model flood which is not an easy thing to do to the extent these things change a lot. But there's also throughput costs because MapReduce may win with respect to response time because you have this massive, them bearing parallelizable mappers and reducers going on across thousands of machines. They might actually do a very good job in response time, but it might hurt your pocket. Again, if it charge model charges for whatever you read. If you have to touch on every single record in the data store you're going to have to pay for it. So your win here might actually hurt your pocket. Okay. So another thing is that key value stores are really challenged. That means two things. One thing is that they're not read optimized. They were basically write optimized. They were designed for write intensive applications for extremely high write throughput. This may sound -- I said two things first thing is we need indices to do that. If we talk about queries and we're going to access something that's read challenged as I call it, then you better have indices to go and get it fast. The other thing is that we're putting with works like this, the key value stores, we are stressing them to the indexing task. In the sense that we do a lot of gets. And gets are not -- they're reads, right? This is how you read from a key value store. Those are not optimized. So using trees, for example, with logarithmic depths might help you. Especially with respect to predictability. I'll do 30 gets to go down from the root to the leaf for a billion nodes, for a billion size dataset. But still if we have talk about interval queries like we saw in the segment tree representation in the key value store, we still have to visit a big number of the nodes doing get for each one of them and that would still hurt performance. So I think there is money to be made recess wise looking at alternative key value presentations where you basically store your trees or whatever in a way that you can get at them quickly with scan operations. So how you play around, how you define your key so that they will be stored together and the quick scan over there will get the results without having to get specific gets sequentially. Okay. So I've gotta stop. And thank you for inviting me. And I'd be glad to answer any questions you have. [applause]. >> Paul Larson: Questions? No more, all right. >>: I just have one, which is the segment index, so I assume one advantage of the segment index is you can map it into the ordinary indexes that you see inside key value stores. >> Peter Triantafillou: Uh-huh, yes. I could do the same with the interval tree. With, yeah, interval trees. Those are similar because they still have the same atomic elementary intervals. We just chose segment trees for reasons you have to do with efficiency even though they are more storage hungry than interval trees. And there's other things. Right? Here I'm talking about what's being indexed here in that work is a one dimensional object, an interval. So if like you've done work, for example, where you build, we have the interval and something else, so you combine like a B tree on key some attribute and you want to have on an interval associated with that as well. So it would be interesting to see how that would map into a key value presentation as well. >>: Did you consider indexing both start and endpoints and doing intersections to find the results? >> Peter Triantafillou: Well, this is -- the endpoints close to that. So for every interval basically I'm having a row that's a start and another row for the endpoint. So it's very close to that. And this is similar like back in the temporal database that were in their height. They were like -- the timing index, I think one of them was called. So basically the idea was the same. It created the different endpoints, but could build something like a B plus tree over that. Here, I don't have to do that. Because in essence if I have this endpoint index, HBase is binary search on this. So it's like I have a binary tree over my endpoints. So I'm getting that for free. So they're very similar ideas that have been played around for some time. And this is a very deep field, because just going into the historical database of the temporal databases with both validity time and transaction time, humongous literature there. But all of them come down to the same thing. To similar things, rather. >>: Not familiar with the actual implementation of HBase, so I was wondering why did you claim that data, we challenge, what ->> Peter Triantafillou: All of them, all the key value stores they're optimized for writes. What is that mean? It means when a write comes in, it's just in a MEM cache somewhere. Particularly when this thing is filled out, what's happening is it's sorted out and it's put into something, into a block ready reading to disk with additional index with a row here. So as you are having updates coming in all the time, you have different of this so-called H files, and this can be sent to disk everywhere. So when a get comes in, you may just have to just get all of these H files to figure out what is the recent get. The why to get. >>: So using log structure techniques for to handle ->> Peter Triantafillou: In essence -- right, the log structured file system, for example, is something that actually permeates all this design philosophy here. >>: But HBase -- HBase is based on big table which is log structured merge trees. So those partition files should be merged back together. >> Peter Triantafillou: Eventually. Eventually. >>: So it's only if you're reading stuff that's been recently written that you have this fragmentation problem where you've got to pull stuff from the recent log as well as from the older ones. >> Peter Triantafillou: You don't always have that problem. The idea is the actual implementation is they use bloom filters. So you're asking for a particular row key. So they use bloom filters to decide which of the different H files you have created which are the blobs of data that's written to disk. Actually have this row in them. Okay. And if they have not been compacted, these H files back into one big H file, if it's compacted you just grab the index of that and you go and get the row. If it's not compacted you have to bring this. So then you have tuning decisions of how fast the big compaction versus the sort compassion and so on and so forth. And eventually you get hiccups, when you do a lot of reads over this, so you see a good performance and all of a sudden you get pppf and what's happening background H is basically trying to reconcile things. Once it's reconciled everything is fast. >>: Do you have control over that in HBase or is that just ->> Peter Triantafillou: I think you do. Like I wouldn't bet my money on it. >>: Configuration parameter. >>: It's a big parameter. >>: Specify how frequently that contact occurs. >> Peter Triantafillou: But you don't know, that's the point. You can't control something you don't know. How the heck would you know what the right time is? This is classical problem for most systems. We have all these parameters and we don't know how to tune them. HBase falls into that category. That's why I call it a challenge. >>: It's hard to solve until we give you a parameter. >> Peter Triantafillou: Yes. If you fine tune it, I'll have algorithms I will show you, but I'm doing great, yes. >>: You talked about the update index as part of your talk. Can you perform any experiments on what is the scalability, what kind of bit rates you can handle? >> Peter Triantafillou: No. This is actually in the list to do. It makes absolute sense. But it's writing basically what eights base can give you. It makes perfect sense to be able to know that. One of the things why we're actually doing this, in other words staying at a high level is because there's a lot of work going on here. And it could be that in three months from now if we're on the same experiment you mentioned we'll get different results, because it's a huge community and the community is contributing stuff and a lot of smart people working here. >> Paul Larson: All right. Let's thank the speaker and I think we're done. >> Peter Triantafillou: Thank you. [applause]