24168 >> Paul Larson: So we are pleased to welcome... the University of Patras. I've known Peter for quite...

advertisement
24168
>> Paul Larson: So we are pleased to welcome today Peter Triantafillou from
the University of Patras. I've known Peter for quite some time. I don't know
exactly how long.
>> Peter Triantafillou: Don't want to say.
>> Paul Larson: No. Peter has been at the University of Patras for about ten
years, and he was at the Technical University of Crete before that and before
that Simon Frazier. And before that University of [inaudible].
>> Peter Triantafillou: Right.
>> Paul Larson: Peter has worked on a lot of things during his career, but mostly
I would say it's been things that involve distributed systems and data
management, file storage, that kind of stuff. Managing data in a distributed
context in various ways. Plus other things.
Today he's going to talk about how to extract more value out of key value stores.
Welcome.
>> Peter Triantafillou: Thank you. Thank you very much. Well, thank you for the
invitation. I'm really happy to be here and present some of the stuff we've been
working on the last year. When I say we, don't want to forget to mention my
colleague Nikos Ntarmos is a former Ph.D. student of mine and Ioannis and
George are basically master's students working with me at the University of
Patras.
So the title of the talk is a bit iffy, complex queries, value scores. So what we'll
try to do is first I'm going to present the overall framework, what the philosophy is
behind what we're trying to do and that saves our approach, and then go into a
bit of more length into two particular types of queries that I think are interesting
and see how we can process them efficiently over key value stores on the cloud.
And I will end the talk using, presenting four or five slides, referring to the major
conclusions, and the things that are interesting that we think we've learned from
this on top of another solution that's good for this particular type of query and it's
fast and so on and so forth. So the general framework is that we're talking about
data management services on the cloud. And they come in the form of queries
and various complexities and we typically have this well known trade off between
storage cost and query processing times, and the idea here is that we would be
willing to pay a bit more in terms of storage cost if that could save us money in
the long run, and talking about money, we think the value to the enterprise will
come from query execution. So the more queries the better, the more money
that would be made, and so the idea would be that either the provider or the
client comes up with smart ways to invest in storage, to build up interesting
indices that would help them expedite the query processing.
A few things about the overall driving philosophy of the work is we want to do
some work towards real time queries on the cloud. The state of the art falls short
of this. We've seen the last four or five years papers published in the major
database conferences trying to do queries such as joins and other things that are
basically use MapReduce in one way or the other to accomplish the task. A
good way to describe the talk would be no MapReduce for noise throughout
query processing. Don't quite believe that, but towards the end I will mention
what I mean by that and to what extent I believe that.
So the idea is the basic driving philosophies, when we want to do MapReduce
we want to build indices, the question is how to build indices, we want to build the
right indices and the key design, one of the key design decisions for us is to have
simplicity. Simplicity design of the index, simplicity in the way you process the
index to come up with the answer to the query. And, of course, we always
realize that what we're going to do is going to be plugged into a big system.
There's a lot of smart folks out there coming up with valuable things in terms of
infrastructure. So we can use that. We can basically piggyback on that in order
to come up with a better system.
And, of course, efficiency scalability always play a big role, and I'll describe, I'll
talk about that later in more detail. So the first part is a bit more details about
interval indexing and query. So there's a bunch of applications out there.
I mention a few here. But basically refer to temporal queries or interval queries in
general, whether it refers to time or not, independently of that.
So there are different types of queries out there. I call them intersection queries
such as in a Web archiving system, I'm interested to find out pages that started
or finished within a particular time interval. For analytics reason or I'm looking for
events that are completely contained, the events span a specific time interval,
this time interval is completely contained within another time interval or I'm
looking for a security event. Say we have a terrorist attack, a word that spans
over five hours and I'm looking to find what industry went on a week before the
start, including a weekend after the start.
So these are what we call containing queries. In general the query types look
like this. It's an interval begin and end time point. So the first part are the sort of
containment, the contain queries. It's obvious what it means. These are the
contained queries are these green -- is this so? It's over there. The green
intervals. The intersection queries are basic queries that are crossing the query
interval either from the left or from the right. Those are the purple intervals. And
a particular type of interesting query is so-called stabbing query where you
basically have a single point with a begin and end time points are the same and
you're going to find out this particular time point which are the intervals that are
being crossed by it.
So what we will be doing is we're trying to come up with indices and queries
processing algorithm to use these indices to expedite answers to these queries.
We're talking the basic infrastructure that we are assuming are key value cloud
stores. It bases the reference. The basic idea here, I'm sure most of you are
familiar with this, we have an HBase master. These are basically tables or parts
of tables if you wish and all of this are data nodes which are called region servers
in HBase lingo, and inside one of these there's a men's store. These key value
stores are optimized for all write throughputs. All writes go through memory and
eventually they're being sorted through these particular file formats with particular
indices and dumped into the Hadoop system in the file, general infrastructure that
we're going to be working with. So the first question is we want to support now
these interval queries. Is there a native support provided by HBase to do this.
Here's an example I'll be using a running example through this which is a Web
archiving scenario. This is most of our data -- this is where most of our data
comes from.
And this is actually what's paying for this research, because this is a big
European project in terms of about four million Euros over three years. It's
basically four partners throughout Europe.
And Patras is in charge of basically coming with indexing part and the query
processing part on this indices for that.
So here I'm having a region server, a data node stored in different regions or
tables. They're all key could be the URL of the site or whatever. There is a
whole bunch of other things we don't care about for purposes of this talk.
And we also have a begin and end period. This could be whatever. This could
represent different crawling times where this thing was alive. Every time you get
a new crawl of the same page, basically a new timestamp starts when the page
crawls that's when interval ends so on and so forth. And in general we have a
beginning and end time point associated with every row key, and the row key
here is a URL.
And similarly here, so if I have a query that comes into the system, that basically
says give me all the interesting things that happen in this time period, then if you
look into the times you'll see that it hits on both of these region servers, and what
can HBase do with this? Basically you can run a filter. You can do a get
operation specifying a filter, and basically the filter is a predicate that basically
says when you grab a row, look at the beginning time intervals to see if they're
interesting with respect to this query interval. Okay? Pretty simple stuff. Nothing
great.
The bad thing here is that obviously this is inefficient and costly. Why? Because
it grabs every single row. It applies the predicate. If it's okay it keeps it.
Otherwise it goes on and on for all the region servers in parallel but still it has to
touch all the rows.
And especially for queries for loss selectivity. Selectivity is an issue here. I'll
touch on that later. And we have quite a few of those loss selectivity queries.
The question is basically can we expedite this? So there's a couple of ideas here
at play. The first idea is what we call a time point, an endpoint index. Here's my
row data. Again, I have a row key which could be the URL or whatever. It's how
it's stored in HBase and I have a beginning and endpoint for each one of those
different row keys. There could be other data here. We don't care about it.
That's why I'm not showing it.
So what we're doing basically is we're coming up with an additional table if you
wish. Call it a family. So we grab every row. This is done using MapReduce
processes in parallel and what you're doing is for every different endpoint that we
see, we create a new row.
And the key here is this particular value. Right? So there's a row now in my
table with value 1-1-1-2010, whatever. And also in this other column family that
I've created here, I'm putting in the corresponding endpoint. Okay. Similarly, I'll
grab this. The endpoint I will create a new row for this. And the other column
family I'll put the left end point of the interval.
Pretty simple stuff. So I will keep grabbing all the rows, creating all these new
rows and this will basically serve as my index. This segment of the table is the
index for answering interval queries on my row dataset.
So let me give you an example of this. So this is my index that I built over the
previous example. And suppose I have an interval query that comes in that
basically says I'm interested in this time interval, tell me what's interesting there.
Okay. So there's a couple of things going on here underneath, first thing is by
design HBase can do really quickly scans. Scans of rows. So by placing
everything by row ID here I could really identify segments of interest really
quickly and get all the rows really quickly, riding on what HBase provides me
already.
Okay. So if I look at the endpoints of this interval in the query. Then this
basically gives me a segment of my endpoints index. And the idea here is while
looking at the results, for example, I see there's different things going on here.
For example, when I see there's this item B both starts and finishes within the
segment of the index, there is this item A that's basically a crossing query,
crossing interval from the left because it started over there and it finishes here.
There is also this thing C that started before and it finishes with this query
interval. There's a couple of things that starts within this particular segment and
finish the outside of it.
Okay? So the results that I get from looking in here is A, B, C and F. I don't get
D. And D's also part of the result in the sense that if you're interested in the
containment queries. Okay. So in order to get D as well I will have to look either
to this side or to this side. The idea would be by scanning my index from the
beginning of time until the end of the segment, specifying the query, I would be
able to get all the interesting things that happened.
Or similarly, by scanning my index from the beginning of the query interval until
the end of my index, I will also be able to get everything. And I would rely on the
fast scan operator of HBase or any key value store to provide me with the
answers.
Note that for a particular type of query, like the crossing queries from left and
right, or the queries that are completely intervals are completely contained in the
query interval, then this thing is really fast. Typically it gives me a small index of,
a small section of my endpoints index and I can get all the relevant data pretty
fast. If I have to do a stabbing query I'll probably have to scan huge portions of
this, and this is going to be huge.
Okay. So that's the take-home message from here. So we've come up as a
second help for this, we looked into the literature as to what our data structures
and there's a lot of data structures here. Indexing methods to be using for
interval queries. So we selected one. That's one of the prototypical segments
for doing this. And it's called the segment tree. And what we've done is we have
provided a key value code. A key value representation of the segment tree on
the table.
And we're using MapReduce faces which actually optimized in ways I hope I
have time to point out in order to create based on their own row data table that's
given to us stored somewhere in HBase, run this MapReduce processes and
come up with a key value presentation of the segment tree which we call MRSD
thing.
So in terms of computing the elementary intervals which are needed in order to
process the segment tree queries it's pretty much -- it's not particularly difficult.
What we're doing is we're splitting the space and we're giving this data items to
two different mappers in this example. These mappers are basically producing
all the different endpoints in the beginning and end times, and some stuff is going
on here so everything is shorted. And in the end what comes out is what was
stored into HBase, once we store in endpoints basically elementary intervals.
And based on that we have a second -- oh, by the way, here we also have
realized that we have all the data that we need to build the endpoints index, we
build the end point index as writing it as a separate table in HBase.
So now that we have the elementary intervals, we feed them into a second phase
in a MapReduce process in the different mappers, and each mapper will basically
create its own binary tree on top of the basic atomic or elementary intervals.
Again, this is not particularly difficult. And then we join all these different
separate that's being produced under one heading, under the one common route.
To give you an idea what it looks like, this is what the MRST index looks like in
eights base. So let's go through it. So what we've done is we have this coding of
the tree in a table and we basically faithfully follow the standard algorithms for
using a segment tree in order to answer interval queries. Suppose we have a
interval query, stabbing query at this point. We know what the root of the tree is,
the root is here on this table.
This table basically stores a row key, and the row key is what is the time point
with which every particular node in the tree is identified. So when you build a
segment tree every node in the tree is identified with a specific time point. For
the root this time point is the median of all endpoints if they're sorted. For every
sub tree, the root of the sub tree is the median of all the endpoints covered by
that sub tree. That's the basic idea.
And also for every node in the tree we have every node in the tree is associated
with an interval. And the interval is basically a union of the interval associated
with their children. With the nodes children. And this is difference of defined.
We have these two pieces of items. The row key is this unique time point. And
we basically have other information. Every node in my segment tree can have a
list of different intervals that are placed on that node. So we have that. And we
also have pointers to the right and left child of every node. Okay. Note here in
terms of traditional index pointers, these pointers are just keys, different keys to
my table. So when this query comes in, I'm comparing this date here with this
date. It falls to the right of it. First of all, I'll grab whatever is in that node. It's a
unique characteristic of segment trees that to answer stabbing query basically
what I'll do is follow a route of the tree to a particular leaf.
And know the segments that are stored with these nodes I'll collect them. They
all continue to be part of the result. So I will compare this date here with this
date, and I will decide whether to go left or right. This date here falls to the right
of this. So we'll go to the right side. First I will grab the item stored there, and I
will keep it from my result list. And then I will grab the right child and basically
sends me over to this row and I will do exactly the same thing, compare this with
that and this falls to the left child. There's nothing to grab there. So nothing's
stored there. So then I would go to this and I will keep continuing the same
process. So this actually has this node here this particular interval stored there.
So I will keep the interval then I will go to the right child and that basically refers
to a lift node it's elementary interval and there is also something stored there, so I
will grab there and this is my answer.
>>: Guarantee to be a unique [inaudible] tree?
>> Peter Triantafillou: Yes, because these are different endpoints, right?
There's only one endpoint. And this actually, I need that because to define them
as HBase keys. All right. So this is basically what's going on here is basically a
traverse tree. Every node in the tree is a row in my table.
And the cool thing here is how many traversals I'm going to do, logarithmic with
respect to the number of things I put there. So that gives me a predictable bound
as to my latency. Okay. So interval queries pretty much go the same. I'm
showing this example to illustrate something which is another characteristic of
segment trees which is the bad side of segment trees. Again I'm having an
interval here. I'm going to my root and I'm comparing this interval and I see that
all of this falls to the right if I remember correctly of the time point associated with
the root.
So once I grab D for the result I'll go to the right child. Then going there I will
continue. There's nothing there to grab. So I will basically see that this interval
now spans both the left and the right children. So I have to descend both down
on both of them. Okay. So then I grab those, and you can basically, the process
follows.
Have to keep track so I don't miss anything. So I collect all the data. The point
here is that if I have interval queries of slightly, big length, depending on how my
segment tree is built, I may have to decent too many nodes in the tree and that
will help me. Every descent I do, every sound I visit that is a get in HBase, that
cost. Stop at this grade because I do I don't know 20 or 30 gets and I'm done
and I grabbed everything. All right. But when I have to grab a whole portion of
millions of nodes doing millions of gets for a particular query this will kill me and
we will see that in the performance results.
So now I have both indices. Okay. Which one is better? And it turns out this
was actually, didn't just turn out, it was by design after spending some time
thinking about these data structures, is that we can get both of the best of both
indices. But what we're after here is to realize which query type is good for which
index is good for what query type. When a query comes in, send the query,
route the query to the particular index. And if the query is too demanding, is
complex and wants both containing and contained and crossing like we said,
then decide which index to use for which part of that. Okay. So this is what's
being played out here. Okay. So basically for an interval query ABI do a quick
scan on AB. Typically this would be small, identify small section of my endpoints
index.
And but as we pointed out in the example earlier I'm still missing something. So
to get what I'm missing basically I'm doing a stabbing query on MRSD. By doing
the stabbing query there, I will get the missing parts. Okay. I'm getting the
missing parts. I'll get some overlap but I can filter those out every time I visit a
node.
So if I am again to to all these intervals, then the purple intervals are the ones I'm
getting for API and the green intervals are the ones I'm getting from MRSD.
Okay. So this is the idea. So the next logical question is what happens now with
updates? And this is a big problem, right? So what we're trying to do here is
basically piggyback on to the basic [inaudible] the values right throughput.
Basically in all of the work in the work that I'm going to talk about, layer -- I'm
indexing a value and I'm having things added to it. So adding things to it to
associate with particular value just adding another column with this.
Okay. I have write throughput for that. That's given to me. Okay. I don't have to
worry about disk nodes being filled out or blocks or whatever. Again we'll get
back to that. So what we're proposing to handle the updates. We're proposing
to have this updates index. The updates index is functionally the same as an
end point index. In other words, it's a table that I can scan in HBase.
The key difference is that this thing is much smaller. Okay. So everything,
whenever somebody comes in, let me actually solve this. So this is my regular
endpoint index. This is my updates index that I have. So when I have to insert a
new record with row TA and intervals between 9 and 19 I'll go to 9 and something
here and I'll go to 19 and that's something there, right? This is how the endpoints
was working, and I'll also do the same on the updates index.
Basically the updates index is supposed to be something very small that's going
to tell me I don't want to be building the tree. The tree building is a costly
process. Segment trees most of the interval structures are static. So I have to
rebuild them again from scratch. What I'm doing I'm dumping all the updates in
the small index here, when my update comes I'll rank in parallel. I'll get the old
stuff plus the new stuff that's been added by doing quick scans of the small index
here. That's the basic strategy. So if I'm to delete something, I want to delete
record with key X, so I'll go and delete it from the endpoint index and I will
basically add tombstone records for that.
So when I scan my parallel, in parallel my updates index and I see a tombstone
record and I go to the tree and it gives me interval X I'll use a tombstone record
to filter that out. That's the basic idea. In terms of stabbing query what's going
on, how we process a stabbing query now or interval query now that we have this
updates index. We run the query on MSRD. Why because it's very fast. I'll do
logarithmic number of gets and at the same time I'm scanning the UI from the
beginning of the UI index until the query time point. This is small. The scan is
fast. So I'm not going to pay a lot.
So then I'm looking to add things that I missed from the tree or things that have to
be removed from the tree. And if I have an interval query then I can run my
query on the tree or I can run it on API or both and I get in parallel. I go and grab
what's new in the updates index.
Recall the updates index is small. So it doesn't hurt performance and we've
tested that. I'll show you some results. So in terms of some experiments, we
have crawls of these domains. Not particularly big. So we used three, five and
nine node clusters, in two we report all of our code there. And in terms of the
algorithms that we implemented is our algorithms for building and not addressing
the queries for query processing, the support, the native support for HBase
provides and we implemented high running on HDFS. We also did it on HBase
but that was way too slow. Even slower than hive.
And here is some results. I mentioned that what we're building we have this
MapReduce spaces we're building the structures and we have a simple version
and an optimized version. If we see just an optimized version, the take-away
message is we see scaleable performance. As we increase the number of nodes
going from two to four to eight data nodes we things scaling nicely. We see big
improvements when we go to optimize instead of the unoptimized version. So
these are fairly big improvements, and again we see a nice scaling with when we
throw more money into the problem by bigger clusters.
In terms of stabbing queries, this is basically what's going on. This is the HBase
filter. This is what we get with hive. It runs basically huge MapReduce jobs
behind it. Again touching in parallel, though, every row in the system. Here's
what we get from API only. If we execute the query, API only, if we execute the
query on API only on segment tree and if we use both we're not going to get any
improvement from this.
So again we see that for stabbing queries we get a factor of two or three better
performance from running it on the tree.
Here running one week old intersection queries what we did is we looked into the
dataset and we randomly picked 100 queries, I think, with -- we picked 100
random one-week intervals. And because the dataset is very dense, this
basically is a lot of intervals. In the particular dataset immediately you see what's
going on with the tree. This is the effect I mentioned earlier, right? Going down
the tree you have to visit basically a big portion of the nodes and you have to
grab huge numbers of intervals, intervals from each one of them.
Again, API stays consistently below a certain bound, and if we do both, that is I
only do on the segment tree, I don't do the whole thing I do the stabbing query
and do the main API and again things are improved significantly.
>>: What's the database here?
>> Peter Triantafillou: We're talking about a few million, from two and a half to
six million intervals.
>>: Up against each record?
>> Peter Triantafillou: I don't remember to be honest. The question because we
have different versions of this. But I think what we did was we just put on to the
tables an HBase only the attributes we care about. So we don't just have all the
records there from the original dataset. If you were to put that in, it would be
because they have the text associated with this. So it would be huge.
So, again, pretty much so this is the one week old for the DMOZ dataset again
we see the same scenario. The tree really is good performance for stabbing. If
you go to -- so here what we did for DMOZ, I forgot to mention, we created
synthetic queries to be able to play with the sensitivity of the selectivity of the
query. So we designed queries that we knew manually were going to give us
25 percent of the whole dataset to see what's going on, and we see again the
trends we have seen before, both as a clear winner if we run the query, the big
interval query on both indices and get the best of both. And running to
75 percent selectivity in this particular dataset again we see pretty much the
same behavior.
Okay. So last set of experiments is when I'm basically running interface queries
interface updates. Now the new player in the scene is the so-called updates
index.
So what we've done we assumed the queries, when the queries start playing a
certain percentage of what was originally in my indices are now being stored in
the updates index. And this percentage was varied from five to 30 percent. So
95 percent here basically means that 95 percent of my data is in the tree and
5 percent of the data is new data that's not in the tree yet it's in the update index.
So when a query comes in remember it goes parallel. It goes on the tree and
goes on to the API or it has to go in parallel to the updates index to get the new
stuff. So the update index was varied up to 30 percent of the original size of the
tree or of the original number of intervals. And, again, the thing to note here it's
pretty stable. So we don't get big -- any deterioration going on in terms of time.
Because in waxing in parallel the update index. So it's pretty standard. We see
the behavior everywhere. Looks like differences but they're really in the
statistical range. So it's not really big differences if you look into the Y axis.
Okay. So in terms of the conclusions for this part, this is a first crack on doing
interval queries in key value stores. The queries are becoming more and more
popular. Basically driven from the need to temper analytics and all kinds of
things like this. So we came up with a few indices that can help us solve the
problem. We build them with MapReduce jobs. We have index processing,
query processing algorithm, utilizing the indices, index maintenance scheme that
is not lock out queries and the queries see the new things that have been put
there. And we've seen big performance in terms of the native support or in terms
of running MapReduce jobs.
>>: So you're reporting latency but you're not reporting cost, and not reporting
throughput.
>> Peter Triantafillou: Right.
>>: What do you say about that?
>> Peter Triantafillou: Okay. So the ->>: Because pile latency is kind of -- pile latencies tend to be rather large.
[inaudible] is kind of a ->> Peter Triantafillou: You're right. For example, in the other part of the work,
we explicitly also reporting bandwidth, which can be an indication of the
throughput that could be achieved. Here, of course, nothing else was running
into the clusters, right? So the thing that was running into the cluster is basically
the hundred queries. So I have a good feeling if I were to report throughput I
would basically be getting the same behavior because of this. So if a particular
index or a particular strategy is very resource hungry it would not be showing
here. Because there's a lot of things going on concurrently. But you're right, if I
were to set it up in a real cluster with a lot of things going on, something that
basically waves bandwidth around, basically would kill my throughput.
So there would not be -- there's no serializing point like the performance. Any
other questions? Yes?
>>: What's the -- each query, the database would have an introduction context,
here provide any transaction, transaction context?
>> Peter Triantafillou: Okay. So this is a big discussion. So the question is what
kind of consistency has been shown here? In terms of the regular queries is
whatever the cloud gives me. Basically what HBase gives you is rate
consistency. So in terms of our update it's also to see we can solve recommit
consistency. So that's about it. Now, there is a lot of work going on that basically
is trying to provide snapshot isolation, consistency semantics or even pure
[inaudible] semantics within an HBase and that would be easy. This is part of
why we did this, right? Anybody underneath the infrastructure give us another
way to define transactions and help, for example, serialize our index updates with
the raw data, fine for us. We just hope that. Because everything is implemented
in HBase lingo, using the APIs, doing everything.
Okay. So the part two I've got about 25 minutes. Part two refers to run joint
queries. This crowd is probably knows all about it. This is a typical [inaudible]
template for this. Typically we're talking about an N way join. There are different
models. Most models assume that one of the attributes in your table is being
used to define this core attribute for a particular record.
Okay? And the key point is that when you join in TUPLs, you have to compute
like an aggregated score that comes from the two relations that are being joined,
and this is typically a monotonic application or summation or something like that,
be using summation without loss of generality here.
So what we have are two different -- again, techniques to do this. The first thing
we're going to do is there's been some work on the centralized -- this is an aside,
there's been some work on the centralized run join algorithms. The most
frustrating thing when I was reading those works was the fact that they don't do
the simplest way of doing this in a decentralized environment. It's a very -almost straightforward way to be able to do this in a distributed environment.
And all of them do very complicated sampling approaches with truth mathematics
to see that the sample is correct with the types of guarantees, but I bet good
money that they would lose against the simple approach. So let me tell you what
the simple approach is and what this more sophisticated bloom filter histograms,
which is the normal statistical structure for N joins that we're coming with but
that's what those are all about.
What is the basic idea? The basic idea is I have my raw data. Okay. Some
records, some row keys. Here's my joint attribute value, and here's my score,
plus other things that I don't care about. So the basic thing here is the record
key. Whatever that was designed to be. So I'm building a inverted score list.
There's basically this thing but built in terms of score.
So my row key here is a score. Okay. And so this is a rehash of this. Then the
vertex score list. The basic idea we're building this, inverted index, right. Based
on score. Score embedded index. Build this using MapReduce. We start
fetching batches of rows in this inverted index list. I have two relations I want to
join and compute the top K join.
I'm going to go to every ISL of this relation and I start bringing in batches of rows.
And when I bring these batches of rows I can perform your favorite algorithm for
centrally producing a top key result. Top K result. In this case the alias will be all
three algorithm which is pretty much a standard in the area.
So when I bring a new batch of rows, I will check every row there against all the
rows I brought previously to see if there's a join. And if there is a join I will
compute the aggregated score for the join, and then I will check to see if there's a
chance that any other records that I have not brought yet can make it into the top
K result. Okay. And if so I will continue to fetching batches. If not, I will stop.
This is basically the Alah Fagan [phonetic] threshold algorithm applied there for
the traditional top K query.
>>: What needs to be true in order for you to be able to stop? What needs to be
true about the aggregated ->> Peter Triantafillou: Has to be monotonic. You're right. I wrote that but I did
not mention that. So here is a typical example of how this works. I'm assuming
here that the batches are just -- I'm bringing a row at a time. So I'm going now -the first rows I'm looking into the join attribute values. There is no match there.
So I'm going to bring the second row and I'm going to try to match this with both
of these rows there. I'm seeing a result in value segment, there's a join. I
compute the score by adding the different scores there.
So that's one result. Then I'm also going to join this one with the first, from the
other relation, and I'm seeing a join or an attribute value 12 and I'm also
aggregating the scores there. So this one is obviously the highest one. So this
is my top K, my current top one join result. And if we look closely we'll see that
nobody else that I can bring from this row and down can have a score higher
than 183. That's why I stop. This is the threshold criteria. Okay. So this is
pretty simple stuff. And we'll see that it's working very nicely. And you can do
this in any distributed system. Even with DHD. I've seen peer-to-peer systems
for this. It's real easy to do. You choose what other people have done for
centralized processing you bring it in with batches and it works.
So the bad thing with previous algorithm is that you bring in TUPLs, and you
don't even know if they're going to join. Right? So depending on the
distributions of the scores and the join attribute values, this may really kill you.
Okay. So the goal is try to bring in only those TUPLs for which you have a pretty
good guarantee they're going to be on the join result. And here's where our
statistical structures come into play. We use histograms where the package of
the histograms reflect score ranges. Okay. And what we're going to be putting
into those histograms are join attribute values. All of it will become more clear in
a minute.
And I don't want to just keep the frequency of the histogram because then I have
to make assumptions upon the distribution of the join attribute values within the
score range, and I don't want to do that because in practice those do not work
very nicely. So what I'm going to do is I want to keep every single attribute value
that went into a particular [inaudible] filter sorry went into a particular batch. But
because this can be huge, I'm going to use a bloom filter to summarize this data.
So here is what the structure looks like. I'm having two relations here. I'm
showing only the join attribute value and the score value. So if I were to build a
bloom filter histogram matrix we call it for R1 what are we doing? We're
processing for every row at a time. So we see that the join value is A. The join
attribute value is A. And the score is one. So I'm having a bucket for every score
range. Every tenth of a score. Say scores are normalized in this case. Okay.
So this here is a bloom filter representing the contents of the bucket.
So I'm going to hash A it's going to get into this position in a bit in the bloom filter
and settle that bit. In this particular case in this particular example I'm using
counting bloom filters -- we see from the score it refers to the first packets. I'm
going to go to this bloom filter and set the bit that corresponds to the hash of C.
Okay. So I keep doing that for all of these. Here .82 foster the hash packet. If
you hash B, I go here and set it for the second bloom filter. This is the basic
idea. I keep doing this. I thought I had skipped that animation. So please bear
with me for a minute.
Okay. So and I've also built similarly the equivalent buckets using bloom filters
for the other relation. Now, what's going to be going on during query processing
is I'm going to be fetching a bucket at a time. Starting from high end buckets.
First I'm going to fit the bucket that refers to the everything that has a score
between .9 and 1 from this relation and this relation. Looking at those bloom
filters if I do a bit wise end I know there's a join. Here there's no join. I have
something that passes here, some that passes here. Okay. No.
So I don't have to build anything. I don't have to bring anything. Now, when I go
and build this bucket here, I will compare it with the contents of this and do a bit
wise N again, and I will see here that in this position I have two TUPLs of the
second relation that has a value for this position for the join attribute and one
TUPL from the first relationship that had a TUPL with join attribute thrusted from
here, so I have a join result here. Actually I have two join results. I have a TUPL
for this one with half value B in this case and one TUPL from that one with half
value B.
Okay. So I have a join out because B is the join attribute value. And I have a
TUPL from here with join attribute value B and two TUPLs from joint attribute B.
>>: Why? Because these are not exact?
>> Peter Triantafillou: I have a counting bloom filter, but you're right there are
four positives. So could be a bit more. So it's 2.02 or something. I'm on the safe
side. I may bring something more but I'm not going to miss anything because of
the false positives.
Let's forget the false positives because that's actually a big discussion here.
Here I have a structure that can tell me when I have a join result. What I need
now is I can also associate with these two join results a high score and a low
score.
Using basically the score range from the buckets. And I use those for my
threshold criteria to decide if I need to bring more buckets. Okay. So this is how
it works. And so the idea is we create a bloom filter for each one of the score
ranges that we care about. And we store it as one column in a row. This is a
blob of bits.
Actually, it's a big discussion here if you're going to have counting bloom filters.
You're going to have plain bloom filters, how much hash functions you're going to
use and how much compelling algorithm you're going to use here. So we've
done believe me a lot of work here, and we're actually going with plain bloom
filters with column compressed which we use a column compressed setting
presentation of them. I don't know many details. My colleague Nikos found all of
this. I designed the structure of the algorithm.
So the other idea then is we need a diverse mapping. Let me just explain what
that is. So here I have my relation. So I'm hashing these values and I'm putting
them into my bloom filter and here are different positions. So into position seven
I had D classes into position 7, for example, and I'm keeping track I have two
things that pass into here. But I also maintain what we call the reverse mapping.
In other words, when the person who, the query node that collects these bloom
filters sees that at position K, I have a join result, there has to be a mechanism
that you can go back and say give me whatever hash into position K.
So here basically I have a table where the keys are, the position indices. They're
nonzero position in my bloom filter. So if somebody comes to this guy and says
give me whatever hash position a hundred bloom filter because I have a match,
he can go here and grab all the relevant TUPL information, such as the TUPL ID.
The joint attribute value and the exact score.
Okay. Again, this is a nice table an HBase. Get quick gets, multi gets and all of
that. So the algorithm works like this. We fetch a bucket from each of the tables
starting from the high end of the range. We compute basically an N of all the
different bloom filters and we use a score to figure out whether we need applying
the threshold algorithm to bring more buckets in. If so, we repeat the process
else we stop. So the join here, there is no join yet, which basically do a bit wise
end and we see whether or not we have enough results for the join, and where
we have enough results for the join we go back to the table and say give me all
the TUPLs, the TUPL IDs, hash into position 100, like I was saying earlier.
Okay. So again for the data, for the performance, we took -- from the TPC eights
two relations the line item and the [inaudible] and create this synthetic one for
reasons I'll explain shortly. We have pretty much the same setup, and we're
doing top K joins where K is ten up to 10,000. We also have high MapReduce
implemented too slow. I'm not going to bother showing it. And this simple ID
with inverted score list, run join algorithm, use batches referring to 1 percent,
5 percent and 10 percent of the complete score mass. Okay. So here what the
results look like, the green bars refer to the bloom filtered histograms approach,
and here we see that the simple idea, the ISL idea works nicely. In the sense
that at least there's always one configuration, in this case the 1 percent score
mass configuration that always beats in terms of query times, the bloom filter
histogram.
Okay. In terms of bandwidth, though, and here's back to the point that you made
earlier, in terms of bandwidth, this is not the case. The compression, of course
bloom filters are summary sets. Anyway, we see from 10 to 30 times better
improvement. And so this would tell me something about throughput and it
would definitely tell me something about cost since you also brought up David,
because most of the charge model charge you for whatever operations you're
doing. So if I'm doing like, for example, the new dyno B charge model, if I'm
doing an operation, I'm provisioning for a specific read through capacity as they
call it. So for every read of one kilobyte I do, I pay. The more kilobytes you have
to read, this translates query, this translates into bucks.
So both in this it's not exactly what you talked about in terms of throughput and
exact dollar values, but it gives some indication towards that.
>>: So isn't this kind of an algorithm, why wouldn't this kind of an algorithm work,
simply to sort of a normal cluster-based distributed database system?
>> Peter Triantafillou: I think that it would. I think this particular statistical
structure would work, yes. It just happened that the environment we looked at
was cloud value, key value stores but I think it would look regular SQL
databases.
So now an interesting thing is looking at the score distribution, this is actually the
different scores, and how many TUPLs from the original relations fell there. So
the reason why the simple idea beat the bloom filter histograms was because it
could stop somewhere around there. And it did not bring a lot of stuff before it
stopped.
So that's why it stopped. What it did basically we reversed. We flipped this
around, this synthetic part of the relation, and indeed we see that this guy is
always better even in time. In terms of bandwidth we show even big, up to 50
times savings.
So conclusions. First crack on run joins. Key value stores. We think as David
pointed out that the statistical structures can be used in other environments as
well. The reason why we particularly like this is the ISL is a simple idea where
everybody's comfortable in CS with building inverted indices and seems to be
doing a good job. There are some configuration issues as to how much to bring
with every batch. But again there is no system that doesn't have this tuning
parameter problem. With respect to the bloom filter, the histogram matrix, we
have bandwidth savings that translates into cost and the query times vary
depending on bringing in the huge filters will actually pay off.
So if the distributions are such that you have to go deep, you may have to bring
the whole filter and that may actually not pay off. So actually we're looking in to
find really bad distributions and negative correlations between the general value
and scores, in order to see when this will not be better compared to the baseline
approach.
So the third part of the talk for about 5, six minutes that I still have, I want to
basically go through a number of things that are beyond the traditional
conclusions that one sees in this type of research, things that we've learned
working with key value scores and trying to build indices for them.
So what we do is we're having a key valuation of indices, that means whatever it
is we put it at the table. Everything for us, everything in the index is an HBase
table. We use the API provided by HBase or whatever key value to get that.
So what is an index need? Especially for a key value environment, for this type
of application. You need fast core additions to the list. You have value and
something else that gives us value and something else that gives us value. This
is easily done by HBase, right. Most value scores. You can easily add columns.
You can do accesses, either for exact match, simple get operations, ruby
operations or for scans. And again all of these are actually optimized, key value
stores exist for these types of workloads. So they're great for building your
indices. And this is actually a big departure from related work. Most of the
related work don't bother building indices. You have a MapReduce job doing
joins with different optimizations, or when they try to build some index there's
been a couple of works that are basically too low into the block level of the disk.
And then you have problems with updates or complete indexes disappearing or if
you want to delete an index how do you delete an index.
You're left with big holes in your actual physical storage of the data.
>>: Is this NOD about going index, it's so costly to build the indexes at the start
that you really have a hard time advertising that cost over the rate ->> Peter Triantafillou: You're right. This is a classical problem. Should I build an
index for this or should I not build an index for this.
One of the things we're looking at if for some reason you're to do a MapReduce
job it's so long it would be faster to build the index first and use it for that query.
But I don't have any hard numbers on this. It would definitely be the case. You
can easily imagine this huge long running MapReduce jobs. Build the index for
20, 30 minutes, an hour. And run your query which would take a few seconds. It
would be better than 15 hours of running with the MapReduce job.
So that's one possible answer. But the idea what we really need here, and I'm
trying to get to that later, is some kind of an optimizer. Right? Should I build the
index? If I build it, should I use it or not. And I'll get to that in a later point.
So this other thing that it's actually closer to my heart now is there was this old
divide between main memory and disk indices. So there's a lot of smart people
that work in computational geometry problems like CS theorists and they come
up with nice data structures. The database never looked at those because they
did it from main memory applications. When the main memory databases back
in the '90s got a big lift some people started looking for them.
There was this old divide. So my idea is if you're using key value representations
of this most of these problems go away. There is no divide there anymore.
Why? Because my index pointer just points to a row, and this row can have as
many columns as you want representing the values that fit this particular row.
So I don't have to worry about note splits and merging and fill factors for my block
on disk, any of that. HBase takes care of that, and it does a good job of taking
care of that.
So ->>: So wouldn't that have some -- the details have some impact on your potential
performance, or are you everything is washed out given the cloud of latencies?
>> Peter Triantafillou: Partly. But okay so the potential performance is -- the
potential performance is when I add stuff, I add columns. This is really done fast.
Because key value stores do this. When I tree stuff just to do on columns which
again there are physical indices that can do this really fast or do scans. So I'm
not really paying a lot. So the whole point of building this indices is to surgically
go and access specific rows or do segments of rows together, using scans, and
the key value stores are fast for these things.
>>: So I have trouble fully buying what you're saying because if that were true,
then that would seem to be to apply to normal database systems as well.
>> Peter Triantafillou: No, because there if you see I have a table I can just add
columns to the table. It's a different data model. See here I can just add
columns. -- okay. So I want to find out which items the attribute value A is
associated with which items. As new items come in, just add new columns for
that in my key value store.
This is really fast. Just the main memory access. And when I get it I just go and
get this value and I would get all of these new columns that have been added. I
cannot do the same with SQL tables. So this is the new -- this way I think is an
interesting, something interesting that comes out of this. And I can make divides
like that go away.
And the point is that there is a lot of smart people out there like I said that they
really came up with this, but they're dismissed by the database community as
being old school or only main memory. So we can use this structure and actually
in other words we're using some of these structures for massive datasets as well.
Not in the normal database sense but at least on tables and key value stores.
Now, on big data indices and MapReduce, now MapReduce will trump your index
if you have high query selectivities in other words if the result is a big portion of
your data anyways, and they'll do it parallellessly; and if you're careful about how
you're doing it, so there's no copying and reading back between the
MapReducers and all that. MapReduce will be a winner there. So the speak is
remember the cute title in the beginning, no MapReduce for NoSQL theories. It's
mostly talking about complementing MapReduce and in fact our goal is -welcome to tackle this problem together with any of you that are interested to
work towards an optimizer. I have a new query that comes in, what statistics do I
need and how do I decide if I have a MapReduce job or build the index as said
earlier and run the query on that.
And when it costs model flood which is not an easy thing to do to the extent
these things change a lot.
But there's also throughput costs because MapReduce may win with respect to
response time because you have this massive, them bearing parallelizable
mappers and reducers going on across thousands of machines. They might
actually do a very good job in response time, but it might hurt your pocket.
Again, if it charge model charges for whatever you read. If you have to touch on
every single record in the data store you're going to have to pay for it. So your
win here might actually hurt your pocket.
Okay. So another thing is that key value stores are really challenged. That
means two things. One thing is that they're not read optimized. They were
basically write optimized. They were designed for write intensive applications for
extremely high write throughput. This may sound -- I said two things first thing is
we need indices to do that. If we talk about queries and we're going to access
something that's read challenged as I call it, then you better have indices to go
and get it fast. The other thing is that we're putting with works like this, the key
value stores, we are stressing them to the indexing task. In the sense that we do
a lot of gets. And gets are not -- they're reads, right? This is how you read from
a key value store. Those are not optimized.
So using trees, for example, with logarithmic depths might help you. Especially
with respect to predictability. I'll do 30 gets to go down from the root to the leaf
for a billion nodes, for a billion size dataset.
But still if we have talk about interval queries like we saw in the segment tree
representation in the key value store, we still have to visit a big number of the
nodes doing get for each one of them and that would still hurt performance. So I
think there is money to be made recess wise looking at alternative key value
presentations where you basically store your trees or whatever in a way that you
can get at them quickly with scan operations. So how you play around, how you
define your key so that they will be stored together and the quick scan over there
will get the results without having to get specific gets sequentially.
Okay. So I've gotta stop. And thank you for inviting me. And I'd be glad to
answer any questions you have. [applause].
>> Paul Larson: Questions? No more, all right.
>>: I just have one, which is the segment index, so I assume one advantage of
the segment index is you can map it into the ordinary indexes that you see inside
key value stores.
>> Peter Triantafillou: Uh-huh, yes. I could do the same with the interval tree.
With, yeah, interval trees. Those are similar because they still have the same
atomic elementary intervals. We just chose segment trees for reasons you have
to do with efficiency even though they are more storage hungry than interval
trees.
And there's other things. Right? Here I'm talking about what's being indexed
here in that work is a one dimensional object, an interval. So if like you've done
work, for example, where you build, we have the interval and something else, so
you combine like a B tree on key some attribute and you want to have on an
interval associated with that as well.
So it would be interesting to see how that would map into a key value
presentation as well.
>>: Did you consider indexing both start and endpoints and doing intersections to
find the results?
>> Peter Triantafillou: Well, this is -- the endpoints close to that. So for every
interval basically I'm having a row that's a start and another row for the endpoint.
So it's very close to that. And this is similar like back in the temporal database
that were in their height. They were like -- the timing index, I think one of them
was called.
So basically the idea was the same. It created the different endpoints, but could
build something like a B plus tree over that. Here, I don't have to do that.
Because in essence if I have this endpoint index, HBase is binary search on this.
So it's like I have a binary tree over my endpoints. So I'm getting that for free.
So they're very similar ideas that have been played around for some time. And
this is a very deep field, because just going into the historical database of the
temporal databases with both validity time and transaction time, humongous
literature there. But all of them come down to the same thing. To similar things,
rather.
>>: Not familiar with the actual implementation of HBase, so I was wondering
why did you claim that data, we challenge, what ->> Peter Triantafillou: All of them, all the key value stores they're optimized for
writes. What is that mean? It means when a write comes in, it's just in a MEM
cache somewhere. Particularly when this thing is filled out, what's happening is
it's sorted out and it's put into something, into a block ready reading to disk with
additional index with a row here.
So as you are having updates coming in all the time, you have different of this
so-called H files, and this can be sent to disk everywhere. So when a get comes
in, you may just have to just get all of these H files to figure out what is the recent
get. The why to get.
>>: So using log structure techniques for to handle ->> Peter Triantafillou: In essence -- right, the log structured file system, for
example, is something that actually permeates all this design philosophy here.
>>: But HBase -- HBase is based on big table which is log structured merge
trees. So those partition files should be merged back together.
>> Peter Triantafillou: Eventually. Eventually.
>>: So it's only if you're reading stuff that's been recently written that you have
this fragmentation problem where you've got to pull stuff from the recent log as
well as from the older ones.
>> Peter Triantafillou: You don't always have that problem. The idea is the
actual implementation is they use bloom filters. So you're asking for a particular
row key. So they use bloom filters to decide which of the different H files you
have created which are the blobs of data that's written to disk. Actually have this
row in them. Okay. And if they have not been compacted, these H files back
into one big H file, if it's compacted you just grab the index of that and you go and
get the row. If it's not compacted you have to bring this. So then you have
tuning decisions of how fast the big compaction versus the sort compassion and
so on and so forth. And eventually you get hiccups, when you do a lot of reads
over this, so you see a good performance and all of a sudden you get pppf and
what's happening background H is basically trying to reconcile things. Once it's
reconciled everything is fast.
>>: Do you have control over that in HBase or is that just ->> Peter Triantafillou: I think you do. Like I wouldn't bet my money on it.
>>: Configuration parameter.
>>: It's a big parameter.
>>: Specify how frequently that contact occurs.
>> Peter Triantafillou: But you don't know, that's the point. You can't control
something you don't know. How the heck would you know what the right time is?
This is classical problem for most systems. We have all these parameters and
we don't know how to tune them. HBase falls into that category. That's why I
call it a challenge.
>>: It's hard to solve until we give you a parameter.
>> Peter Triantafillou: Yes. If you fine tune it, I'll have algorithms I will show you,
but I'm doing great, yes.
>>: You talked about the update index as part of your talk. Can you perform any
experiments on what is the scalability, what kind of bit rates you can handle?
>> Peter Triantafillou: No. This is actually in the list to do. It makes absolute
sense. But it's writing basically what eights base can give you. It makes perfect
sense to be able to know that. One of the things why we're actually doing this, in
other words staying at a high level is because there's a lot of work going on here.
And it could be that in three months from now if we're on the same experiment
you mentioned we'll get different results, because it's a huge community and the
community is contributing stuff and a lot of smart people working here.
>> Paul Larson: All right. Let's thank the speaker and I think we're done.
>> Peter Triantafillou: Thank you.
[applause]
Download