>> Ken Eguro: Good afternoon. Today it's my

advertisement
>> Ken Eguro: Good afternoon. Today it's my
pleasure to introduce Jens Teubner. He is leading
the database information systems group at TU Dortmund
University in Germany. His research interests
include -- well, are really data processing on modern
hardware platforms. This includes FPGAs, multi-core
platforms and hardware accelerated networks. So with
that I'll keep the instruction short. Jens.
>> Jens Teubner: Yeah, thanks a lot, and thanks for
all coming. And, of course, feel free to interrupt
me.
I've given this talk a little bit of a catchy title:
Sort versus Hash. That's one of the longstanding
debates in the database community, it's like a
pendulum swinging back and forth what is to be
preferred as an approach, and I'm going to look at
this discussion in a main memory context, so I didn't
exactly know how deep the database knowledge is in
that audience, so basically for all of the core
database operations, be it aggregation grouping,
duplicative elimination, or joins, they basically all
boil down to very similar implementation which can be
based on sorting or hashing.
In this talk I'm going to focus on joins, so I'll,
for the most part of the talk, I'll talk about joins,
so we want to implement joins in main memory as
efficiently as possible.
And again, there's basically two approaches. One of
them is to use hashing, so a hash join. I just, as a
recap, try to visualize it. What you do is you have
to input tables and what you do is you scan one of
the two input tables. While scanning the input
table, you populate a hash table with all the tuples
from R, and then in the second phase of the
algorithm, you scan the other relation, and for each
tuple probe into that hash table.
Hash join is popular because it has nice complexity
properties. I wrote here a little bit fuzzily O of
N, so it's linear in the sum of the sizes of the two
input tables and it's very nice to parallelize.
We'll see that in a bit.
Alternatively, and that's what I think databases that
run off disks prefer today, is the so-called
sort-merge join, so what you do is you have your two
input tables. They are in some order. So as a first
step of the algorithm, you sort both input tables and
once they are sorted, joining all of a sudden becomes
trivial, by just merging the two, so that's the
so-called sort-merge join.
It's used a lot, I think, when you deal with external
memory because databases have techniques to deal
with -- to sort arbitrary amounts of data quite
efficiently. Strictly speaking, its complexity
properties are not quite as nice, but we'll see
actually in the end they all boil down to be very
similar.
Now, I said I'm going to look at both sides in a main
memory context, so the question is which of the two
strategies do you want to prefer in a main memory
context. Classically, hash join has always been the
preferred choice because -- well, if you take random
access memory in its literal meaning, then this
random access pattern of the hash doesn't hurt you
when you run stuff from main memory, and then you
prefer the simplicity and the nice complexity
properties.
But the question is how does this look like on modern
hardware, and that has, like in the past years, led
to quite a bit of discussion in the community which
of the two strategies should be preferred. I think
one of the quite well known papers here is one that a
team with people from Intel and from Oracle did was
they did a study where they wanted to compare the two
and they tried to make predictions of what's going to
happen in the years to come.
And what those guys found out was that hash join, in
their setup, was the method of choice. It was about
two times faster than sorting, but they predicted
that as hardware emerges or evolves, it's quite
likely that very soon sort-merge is going to overtake
hashing in performance.
And that is because of the increasing SIMD width for
the vector instruction capabilities of modern CPUs.
Those tend to be a very good fit for sorting, but not
so much for hashing, that favors sort-merge join.
And their projections from about four years ago were
that if you have an architecture with 256-bit SIMD
registers, then both should be about the same speed,
and after that SIMD vector lengths continue to
increase. Sort-merge is going to be faster.
Now, more recently a group from Wisconsin, they
looked at hash joins only, and they sort of did
conclusions only within the hash join world. We're
going to see what exactly this statement here means.
What they basically said is that a relatively eve
[phonetic] of implementation of hash join is superior
over implementations that are highly specialized to
modern hardware, highly optimized to modern hardware.
I listed that also because even more recently there
was another paper that did compare sort-merge and
hashing. That's a group from TU Munich. They took
these results here, did a transitive conclusion, so
they found that their sort-merge join algorithm is
faster than this hash join implementation, and by
transitivity, this is also faster than this one,
which means sort-merge join is the method of choice.
That's kind of what they say.
So this is all assume that we have similar size
relations on both sides or -- because you can make
hash join look better by if you have a tiny build
site, right?
>> Jens Teubner: We're going to see that indeed a
little bit, right? And so, in fact, that kind of
brings us to what I want to do in this talk. I want
to -- of course, the comparisons on that previous
slides, they were really apple-to-oranges
comparisons, right? And particularly these
transitive conclusions, right? You choose
different -- you prove something on one workload and
then the transitive conclusion on another workload,
so from that you can conclude anything. So ->>: So the N log N, of course, will eventually, if
the N is large enough, dominate any constant factor
which you might get from the linear one, so depends
on how big the relations are overall what you're
going to get.
>> Jens Teubner:
>>:
Sorry.
I didn't --
Sort-merge is N log N.
>> Jens Teubner:
Right.
>>: That will eventually dominate any algorithm and
in terms of its total cost would be larger.
>> Jens Teubner: Yes, but it turns out that
efficient hash join implementations typically also
you do -- we'll see that you do multi-pass hash joins
and then you end up with a log N factor again. So
it's not really clear whether that complexity
argument really makes sense. Okay.
So what I'm going to do is I'll try to -- I'll do
apples-to-apples comparisons and I'll sketch a little
bit, like implementation details that we found out to
matter and we try to -- or I tried to in the end make
a little bit conclusions about where are we in terms
of the sort versus hash discussion.
So the agenda for today is I'll first look at hash
joins in more detail, then the sort-merge joins, and
then I'll compare the two, yeah.
Okay. So again, hash join is nice because it's so
nice to parallelize, so what we're going to look at
is, of course, we look at modern hardware, so that
includes multi-core parallelism. Of course, this is
just a slide to show that hash join is basically
trivial. To parallelize, you just chunk out both
input relations into partitions that you assign to
individual cores and then you can do the hash table
build and the probe in parallel. Of course, still
one after the other, but then you can use all your
cores to do a join build and a join probe.
This in principle works very well. That's because
you do need a locking mechanism here because that, in
the build phase, the hash table is accessed
concurrently, but you benefit here from the -- I
mean, the commonalities are easily in the millions,
right? And then the chances of getting a collision
on that hash table is basically negligible.
So this parallelizes very well. You won't see any
lock contention in practice, just because there's so
many hash tables, hash table buckets in practice.
So this is like the basic hash join implementation
that we're going to look at first, and this is what
this group from Wisconsin claims to be the preferred
one because it's fast and because it's simple.
Now, if we first look at -- so what we did was we
took their implementation, analyzed it carefully, and
tried to see whether the comparisons that they make
are actually valid, whether they make sense or
whether they're doing apples-and-oranges comparisons.
And for that what we first did was we tried to see
whether there's any kind of situations where we can
improve their code.
And I have one example here just to show you that you
have to be very careful to implement such an
algorithm, otherwise you're going to suffer quite
significantly from facts in modern hardware, and one
of these examples is -- so what I'm showing here is
the hash table implementation that the people from
Wisconsin used.
So in order to implement that hash table, you need
some locking mechanism, so you have to -- they
implemented that as a separate latch array, and then
the hash table consists of a hash directory that
points into individual buckets that are allocated
from a heap.
Now, when you do any operation on that hash table,
what it means in the end is you need to access three
different locations in memory. You need to first get
a latch, then you need to follow the pointer, and
then read out your data, which amounts to three
memory accesses per tuple; and, of course, you can
also do that with just a single memory access per
tuple by simply collapsing that into a single data
structure.
And why that is important you see on this slide here,
so when you start analyzing the performance of that
implementation on modern hardware, what you see is
that -- so the amount of work that you have to do per
input tuple is actually extremely small. So looking
at the hash table build phase here, what you need to
spend is per tuple about 34 assembly instructions,
and now it makes a real difference whether during
those 34 assembly instructions you see one or three
cache misses.
In practice we saw something around 1.5 cache misses.
That was because -- well, it depends on how you cache
line, align your data. You can also bring that down
to one if you want to, and we saw TLB misses quite a
lot.
>>:
So these numbers are [indiscernible] modified --
>> Jens Teubner: Exactly, yeah. These are for
this -- so here you see three misses. Here you see
1.5. The 1.5 comes from -- so if an allied address
starts here, then this is the first row, and the next
one will come right after here and it will span over
two cache lines, and then you get two misses for just
a single read.
And then you can do a tradeoff. We play around a
little bit with that. You can -- well, we wanted to
keep the number of tuples per bucket consistent;
otherwise, you introduce another parameter which, you
know, you can vary around arbitrarily. You can blow
up a little bit so you spend more memory, which
causes you more cache miss -- more cache occupation
on one hand side, but with aligned access, you can
reduce the amount of cache misses per tuple, but
performance-wise it didn't make a huge difference.
Okay. So either way what you end up with is you have
in the order of 30 CPU cycles work to do and that -well, and at the same time you need to do -- well,
1.5 memory accesses plus a bunch of TLB misses that
you'll suffer. So here you're easily talking about
200 cycles or so that you need, right? So in the
end, you're going to be severely latency-bound,
right? All the costs you're measuring is those cache
misses.
And that's why in practice or we think that such an
implementation on modern hardware won't make sense.
People have realized that earlier. One way to avoid
these many cache misses is to choose a partitioned
hash join instead.
So the idea there is that before you build up your
hash table, you break up your problem into
sub-problems, so you use hash partitioning to break
up your input relation in cache-sized chunks. You do
that for both input tables, huh? And then you
basically need to only join corresponding partitions,
and if those are small enough, then you can keep all
the processing for one such pair of partitions inside
the cache and you won't suffer any -- you won't see
any cache misses anymore here, huh.
So once you do that, you can -- yeah, each of those
joins basically runs out of caches and you will see
basically no cache misses anymore.
Again, you have to be careful when you implement
that. So what the Wisconsin guys did is they took
this very literally. They partitioned, build all the
hash tables, and then probed into all the hash
tables, huh? What this means is that you build up
this hash table, you build up all those other hash
tables, and eventually you're done with all your hash
tables.
And then you go back up here, huh? And by the time
you want to access that hash table, it's, of course,
gone long since from your cache, right?
If instead you interleave the build and probe phases,
you can reach a situation where you build out that
hash table, it sits in the cache. Then you probe
from that side, huh? You use the data that's readily
in the cache, and then you move on to the next
partition. And again, you can save a lot of cache
misses from that.
>>: So is this [inaudible] so you have both
relations [indiscernible]?
>> Jens Teubner: Yes, yes, yes. I mean,
throughout -- I assume everything's in main memory,
yeah, yeah. Yeah, yeah. Yeah. I didn't say that
explicitly enough. So what I'm assuming is I have
everything in main memory. We also assume like a
pretty much column-oriented model, so we assume very
narrow tuples, so you saw that here. So we assume in
that case, in that picture, we assume 16-byte tuples
only, eight ->>: But if one of the sites is really another join,
then do you get stemming there?
>> Jens Teubner:
You get what?
>>: One of the sites -- you can have multi joins
going on, right?
>> Jens Teubner:
>>:
Right.
You can have R join S join T.
>> Jens Teubner:
Right.
>>: The R join S fitting into the top join, you have
to keep the result of that also in memory, right, is
what you're saying?
>> Jens Teubner: Well, but a join is blocking
anyway. I mean, you anyway have to ->>: But what I'm saying is it's good enough to have
R and S in memory. Now you have I join S also in
memory. To compute I join S joined ->> Jens Teubner: Yeah, but you need to -- I mean,
either way, you need to keep the -- either way you -when you build up the hash table over R join S, you
need to keep that in memory anyway, yeah.
>>: But [inaudible] extend, you would only keep one
of the sites in memory?
>> Jens Teubner:
>>:
Okay.
Okay.
Yeah.
Okay.
[inaudible] space --
>> Jens Teubner: Okay. Yeah. I mean, this is
really assuming, you know, memory is cheap, huh?
Just have everything in memory, huh? Yeah. Which I
think, you know, is increasingly reasonable, I mean,
huh?
I mean, you guys are working on in memory processing,
as well, right? So. Okay. Now, once we do that, we
indeed basically avoided almost all cache misses,
yeah? You still have some -- I mean, at some point
you still have to bring stuff in cache, so you do see
a bunch of cache misses, but not many.
And at the same time, the nice thing of this -- about
this partition hash join is that after partitioning,
you end up with a lot of independent work, so no need
for a shared hash table anymore. You can just do the
R1 join S1, R2 join S2, you just do those on separate
cores, and there's no need to synchronize through
locks anymore.
>>: Can you explain a little bit more why it's
[inaudible] -- do you sort them or something or,
like, how are partitions created?
>> Jens Teubner: Well, you can only find matches
between here and here, but not between here and here,
okay?
>>:
What way?
>> Jens Teubner: Because you hash-partitioned here.
>>: [inaudible] like a regular ->> Jens Teubner: So you use your key value as a
partitioning function, all coupled with the same key
will go into the same partition and you use the same
partitioning key on both sides, okay?
So you will only find matches between, you know -how can I say that? -- neighboring partitions here.
And then -- yeah, you can run this join on one core
and this join on another core and they won't
interfere at all with each other. No need to protect
with locks or anything.
And that makes your code even simpler, so we end up
with 15 to 21 instructions per tuple for build and
probe phases. Almost no cache misses anymore, so
this join -- I mean, we've reduced quite a bit of the
cost, yeah.
>>: So can you go back to the [inaudible] -- I'm
still not really -- I don't know exactly what the
difference between the picture here and the picture
you displayed. Is that each of your, like rows is
now big enough so that I don't need to follow
[inaudible] to into a bucket? Is that the main
difference, or I'm lost.
>> Jens Teubner: Well, you -- because you
partitioned this stuff here, these hash tables are
now hash tables over just a fraction of the input.
In practice you don't do just four partitions. You
create thousands of partitions, okay?
>>: How is each partition subject to [indiscernible]
if I put the partition as a [indiscernible] I would
be in the same [indiscernible] instruction.
>> Jens Teubner: Oh, this is just a part of your
table. It's just, you know ->>:
So R4 plus S4 is [indiscernible] --
>> Jens Teubner: Yes, it's actually enough -- it's
enough to have the hash tabling cache, so R4 has to
fit in the cache, assuming that the hash table is
approximately the same size as the table, right? The
hash table has to fit into the cache. That's enough,
right?
So you break up this input relation into chunks that
are cache sized. Suppose you have -- I don't know -400 megabytes here, you have a four-megabyte cache,
you create 100 partitions, and then for each of the
partitions the hash table will fit in the cache, and
then you can run the hash join from the cache.
>>: You don't get any misses, TLB misses anywhere,
so the next time you said there's 0.0 ->> Jens Teubner: Well, it's basically -- it's, of
course, you always do get some, right? There's
always some compulsory misses, right? But these are,
you know --
>>:
How do you avoid this in the probe?
>> Jens Teubner:
cache.
During probe, everything's in the
>>: But the assumption -- I mean, after your hash,
basically you have pointers with random addressees,
right? After you hash, they say that these pointers
are sequential.
>> Jens Teubner: Yeah, yeah, but they're all -- you
know, this is small. This fits into the cache and
once stuff is in the cache, then that's what the
cache is for, right? Once you keep your -- all
your -- I mean, on both sides, once you keep that
contained in the cache, then you won't suffer any
cache misses anymore, and the TLB also is enough to
cover the entire hash table if the hash table is
small enough, yeah.
>>: I think it's [indiscernible] but the
partitioning [indiscernible] ->> Jens Teubner:
>>:
Exactly, yeah, yeah.
[inaudible]
>> Jens Teubner: That's what's coming next. Yeah.
That's what's coming next, yeah. So what I did not
yet tell you is that partitioning, itself, introduces
a new cost, and that's where the TLB indeed comes in.
So if you do partitioning, then it depends on the
number of partitions that you want to create
concurrently. Partitions are roughly cache-sized, so
it's safe to assume that one partition is larger than
a memory page. A memory page is just four kilobytes,
so in the end, each partition will reside on a
separate memory page, so you need a TLB entry if you
want to do partitioning, you need one TLB entry for
each of the partitions in order to write into those
partitions.
And we have some experiments for that, so when you
start -- so when you do partitioning, so these are
assuming just a single partitioning thread, now you
vary the number of partitions that you want to
create, so this is the logarithm of the number of
partitions, so you create two to the power of four up
to two to the power of 16 partitions.
And what you can see here is that for a small number
of partitions, partitioning can be done very, very
efficiently, about 100 million tuples per second.
That's because partitioning it's -- conceptually it's
very little work. You just compute a hash value over
the key, that tells you where to put the tuple and
write it there and then you're done. It's not much
work. But once did you ->>: But it does have cache-missed properties too, of
course.
>> Jens Teubner: Yeah, the cache misses are not
really problematic because you're talking about here,
let's say, two to the power of 14 addresses where
you're concurrently writing to, so you need about two
to the power of 14 cache lines, right?
And that's not much data. That's -it's a few kilobytes, so because you
partition sequentially, right? It's
whatever. If you would do that in a
it's -- yeah,
write into each
like having -file system --
>>: Why assuming that you write in the partitions
[indiscernible]? I see how you could write it into
the partition sequentially, but the actual process of
partitioning is not sequential; that is to say ->> Jens Teubner:
>>:
Yeah.
-- as you process a tuple --
>> Jens Teubner: Yeah, yeah, but for each partition,
you only need to keep like the next position in
memory and those next partitions, those aren't many.
Those are -- that's not much data in total. So you
can get that in a cache, right?
Like for each partition, the slot for the next tuple
to insert, that's just two to the power of 14, say,
slots. That's a few kilobytes of space, right? You
know, you can -- like in the classical database, you
can create -- you can create partitions over
terabytes of data with just a very small buffer pool
because all you need to keep in the buffer pool is
just the next page to write out for that partition,
but you don't need to keep the full partition or
something in the cache. Does that make sense?
You're skeptical still?
>>: Well, the total number of misses is the total
size of however many [indiscernible] you have divided
by the number of entries that you've used to fit on a
single cache [inaudible] effectively.
>> Jens Teubner: Yeah. But -- yeah. But, you know,
the -- how can I say that? You're writing into each
partition sequentially, so very likely you like have
for each of those precisions, the area around your
current writing position, you have that in your
cache, so you don't need to -- you don't -- and then
writing there gets quite efficient because you just
need to write into the cache, and then if that cache
line is full, it's being written out to memory, but
you don't even have to wait for that. You can still
continue -- it's write only.
>>: But the total number of entries divided by
however many [inaudible] ->> Jens Teubner:
Right.
>>: Fill, fill, fill, fill, now it's full.
Now I've got to chunk them out.
Okay.
>> Jens Teubner: Right. But that's different, like
when you -- during a hash table build or probe where
you randomly put individual tuples in there, then
typically you have to bring in the cache line, do the
update, and bring out -- write out the cache line
again, because you don't write those -- you don't
have this sequential pattern as you have it in
partitioning. I don't know if I'm making sense.
Okay. Now, the problem is, of course, we want to
create many partitions because we want to have those
partitions small so they fit into the cache. The
smaller the better so we can bring them in all tour
[phonetic] or something. In practice in our
experience what we would like to be -- for the data
sets we used, you would like to be around here, so to
the power of 14 partitions for the workload
configurations that we had, this was a good choice.
But, you know, in this picture it doesn't matter much
whether, you know -Now, what can we do about this? And the classic way
of doing that is multi-pass partitioning or also
called radix partitioning. It's basically an idea
that's been done also in disk space systems. You're
limited by the number of partitions that you can
write concurrently. So if you can write at most, say
100 partitions concurrently, you can just do multiple
passes and refine your partitions again and again,
and this way you never exceed your maximum number of
partitions per partitioning pass and still end up
with as many partitions as you want in the end.
And that's where the log
join, right, because now
partitioning in multiple
passes is logarithmic in
have.
N factor gets into hash
you have to start
passes and the number of
the amount of data that you
>>: In some respects it's all partitioning, the last
step included?
>> Jens Teubner: Yes, yes. It's true. It's true.
Yeah, it's basically the same operation, hash table
building and it's true. Absolutely.
>>: You can terminate it anytime you figure your
bucket is small enough.
>> Jens Teubner:
true.
It's true, yes, yes.
Yes.
It's
Now, if you do that, it turns out the cost of an
additional pass, it's additional CPU work that you
do, but it saves you all those TLB misses, and as you
can see here for the area where we're interested in,
it gives insignificant advantage in terms of the
partitioning cost, right?
You can do better even. The trick here is that -- so
this is how you implement partitioning easily. You
take the key, hash it, and then just write it where
it belongs to.
What you can do alternatively -- and then this is
always a write, right? And for that you need a TLB
entry for writing that element out to memory.
What you can do instead is like create like a little
buffer in between. I've written this as an algorithm
here. So you would hash your key, not write it out
directly to memory, but first fill up an in-cache
buffer, and only when that in-cache buffer is filled
up, then you flush out the entire buffer slot out to
your partition. So for each partition, you have a
little buffer, you fill up the buffer, and only when
the buffer is full, you flush it out to the
partition.
What this means is that if you can have something
like eight tuples in each buffer slot, you get a
memory operation or an operation that goes all the
way to entry that needs a TLB entry. You need that
only every eighth tuple, so you can push down the
number of TLB misses, and once you have sufficiently
few TLB misses, then they won't cause any harm
anymore because then the -- all the out of order
mechanisms in your CPU will be able to hide the
latency that you get from those TLB misses, right?
You can just keep processing while the hardware just
does the TLB miss for you.
In practice you choose your buffers to match the
cache line size because then you can do those
transfers particularly efficient.
>>: The effectiveness of this buffer mechanism is
only as good as your hashing function, right?
Because if your hashing function is [indiscernible]
all over, you might wait too long and waste lots of
memory.
>> Jens Teubner:
Yes.
>>: So was there any correlation, anything between
your actual data and the hashing function or ->> Jens Teubner: Well, you know, we don't have -- as
an academic, we don't have real-world data, so we
were dependent on synthetic data anyway, right? Once
you operate on synthetic data, then, you know, you
can use anything for a hash function because, you
know -- yeah, you don't have that nasty type of
correlation in your data. So we kind of -- I cannot
really ->>: It's true. I mean, it depends on whether you're
dealing with distinct values or not.
>> Jens Teubner:
So what we did do --
>>: If you're not dealing with distinct values, then
you have all sorts of distributions.
>> Jens Teubner: Yeah. What we did do -- but I
claim that's kind of independent of the hash
function, right? If you have duplicate values in
your data, yes. We -- I didn't bring experiments
with me, but we did measurements with skewed data,
but still, then, you know, you don't have funny
correlations between the data that would affect the
hash tables, so how can you say that?
>>: [inaudible] you have sequential keys and your
hash is like CSE32 or something. Each load will go a
totally different place. So this buffer won't be
usable in some sense. So ->>: Oh, because the radix that you use for the hash
depends on how many you can fit.
>>: So I'm trying to understand was there any
knowledge of the incoming data to make the hashing
function friendly enough to leverage the buffer or ->>:
The buffer --
>> Jens Teubner: Not because we -- the moment you
generate synthetic data, I claim there always -every hash function is friendly enough.
>>:
Okay.
>> Jens Teubner: Unless you build very funny -- I
mean, very artificial distributions where, you know,
if you -- so we used a zip distributed data, but we
permuted our alphabet beforehand. Otherwise, you do
get very strange effects, but they don't reflect
reality either.
>>:
Okay.
>> Jens Teubner:
Yeah.
>>: Again, if you have nondistinct keys and they're
distributed in some way so that you get 50 percent of
the keys are all the same or 25 percent of the keys
are all the same, you're going to have very bad
skew ->> Jens Teubner: Yes, yes, yes, yes. We did do
measurements. I didn't bring them with me, but I can
say a few words once we -- so actually, we'll soon
see a bunch of experiments, right.
So once you do this, it turns out that with a single
pass you can actually, over quite an interesting
range of partitioning fan-out, you can do with a
single pass even better than with radix partitioning,
which is I think -- well, up until now has been
always considered sort of the strategy of choice.
So putting this together, what I'm doing -- what I
did here is I show some measurements that we did
comparing this -- what the Wisconsin people called no
partitioning join that is a hardware oblivious join,
not considering any caching effects or anything,
versus this radix join which does the partitioning
and all that stuff.
What you see is not partitioning, indeed, gets
responsible for a large share of the cost, whereas
once you partitioned, doing the actual join gets
negligible cost. So this is the number of CPU cycles
per output tuple of your join. And, yeah, contrast
to that because of the many cache misses, of course,
that you lose a lot of time in the build and probe
phases of the no partitioning algorithm.
This is a benchmark configuration that we took from
the Wisconsin paper, and to give you a feeling, so on
our Nehalem machine that we use it's like -- I think
it has four cores, you get in the order of
100 million tuples per second in join throughput.
>>: Do you have another chart which shows the
[inaudible] cache, I mean, through level cache size
for each? I was wondering if there was a correlation
between the cache size and the [indiscernible].
>> Jens Teubner:
>>:
So -- I don't remember, no.
[inaudible]
>> Jens Teubner:
Yeah.
>>: Can't remember the numbers [indiscernible], so
I'm just wondering if there's a correlation between
cache size, this algorithm, and these ->> Jens Teubner: Well, this difference here comes
from the Sandy Bridge having much more cores. So
this is a parallel implementation, and then you -basically this is a measure of wall clock time kind
of, so you have more cores, you need less cycles per
tuple.
>>: My assumption was the number of cores used was
fixed, was just measured ->> Jens Teubner: No, no, no, we -- no. Yeah. So
this was -- so basically we used -- in all those
machines we used one socket but used that, I mean,
fully.
>>:
Okay.
>> Jens Teubner: And there was this question earlier
about the impact of -- first of all, so to show you
that, you know, we work carefully, so we really tried
hard to make both implementations as fast as we
could. We think we can show that by showing that our
implementation is roughly a factor of three faster
for both cases, approximately, as compared to that
existing paper.
Now, what you see here is that -- well, the Wisconsin
guys, based on a picture similar to that, actually,
they had just those two sides, they argued that the
hardware oblivious implementation is still preferable
over a hardware optimized implementation because
depending on the platform, chances are higher to
get -- to pay a penalty here as compared to the
little benefit you get here.
Now, once you go to a different workload, so this is
the workload that has earlier been used in the
literature, all of a sudden the game changes quite a
bit. So these are two equal sized relations and the
hardware-conscious implementation clearly has an edge
over the hardware-oblivious implementation.
The hardware-conscious implementation also scales
very nicely, so this is the Sandy Bridge machine with
eight cores on one socket, and if you start scaling
that up on a four-socket machine, you can reach cache
throughput -- join throughput of about 600 million
tuples per second.
There was the question previously about skewed input
data. It turns out that the hardware-optimized
version, this radix partitioning turns out to be much
more robust towards skewed distribution, so you get
about equal performance over quite a range, a wide
range of skew values.
For -- so the hardware-oblivious implementation is
quite skew sensitive. There you typically get an
advantage from skew because your caches get more
effective because you hit the same cache line more
often.
Okay. So our conclusion from that is that
hardware-oblivious algorithms, you know, if you
choose the right benchmark, they do look good, but -so we have an ICD paper on that. There's some more
experiments in there and those all indicate that with
hardware-conscious algorithm, you can typically do
much better.
And interestingly, we found that these
hardware-conscious implementations are extremely
robust also to the parameters that you choose, so
there's a few parameters in there, like cache sizes
and number of partitions and all that stuff, yeah?
And of course, getting those parameters right might
affect the performance that you see, but it turns out
it's actually the algorithms are quite robust toward
that, so a slight missetting of those parameters
won't kill your performance at all. It doesn't make
much of a difference.
Okay. Let's then have a look at sort-merge joins.
So sort-merge joins, essentially there all the cost
boils down to sorting. Once the data is sorted,
doing the join becomes almost trivial.
Sort-merge -- and then the sorting part, you
typically implement as merge sort. Sorry for the
name confusion, yeah? That is you first create some
presorted input runs and then you merge them like
recursively until you end up with a sorted -- with
sorted data.
In practice on modern hardware, I try to visualize
how this looks like in practice. In practice what
you do is you -- if you have a large system that's
going to have a NUMA memory architecture, so here on
the graph I assume you have four NUMA regions. What
you then -- or one way of doing the sorting then is
to perform first, so you assume the input data is
somehow distributed over all your NUMA regions. Then
you typically do first a local sorting within each
NUMA region to avoid memory traffic across the NUMA
regions.
Then if you want to obtain a globally sorted result,
which I'm assuming here, then what you need to do is
you need to merge across those NUMA regions as shown
here. So you do the two-way merge here and another
two-way merge -- well, two two-way merges here and
then another two-way merge here until you end up with
the sorted result.
>>:
But you think that you have [inaudible], right?
>> Jens Teubner:
Pardon?
>>: The penalty of local [indiscernible] NUMA, it's
being paid in a second step anyways?
>> Jens Teubner: Yes, yes, yes, yes. But, you know,
you want to avoid, you know, moving data back and
forth for something, right? So here at least
clearly -- you know, if that is the result that you
want to have, there's no way around getting this data
over here. So you need to cross that NUMA boundary
somehow, if that is what you want as a result.
>>: So you don't get everything completed sorted,
though, right? You can have ->> Jens Teubner: Yeah, we'll get to that.
to that. That's true, yeah. Okay.
We'll get
Now, I was already mentioning here the bandwidth that
you need to do the merging, right? And in fact, the
merging is a pretty critical part in that sort
operations.
If you -- so for both parts, for run generation and
for merging, what you can do is you can implement
them quite efficiently using SIMD acceleration based
on sorting networks, which is where this Intel work
said in the future this run generation and merging is
going to become even more efficient if you have wider
SIMD registers.
Just a comment here, you know, like in hash join
partitioning was similar to hash table build-up, here
you have a similar thing that merging for merge sort
is a very similar operation to the join afterward.
So there is a difference in the sense that it's
tricky for that part to use SIMD because your tuples
look differently and that makes it tricky to use
SIMD, but other than that, it's a very similar task.
So what we're now interested in is the merging part
mostly. Here I have some numbers for both parts, for
run generation and for merging in the merge sort
sense, yeah.
If you implement this well using SIMD operations,
assuming you create runs with four items each initial
runs, you can do this with 26 assembly instructions,
and for each of these assembly instructions, you -for each of these -- for each -- so with 28 -- 26
assembly instructions, you can sort four sets of
data.
And then assuming four byte keys and four byte
values, you end up with reading in 64 bytes and
writing out 64 bytes. So in total you -- there is a
ratio of 26 assembly instructions to 128 bytes moved
between memory.
Similarly for the merge pass, so you can do merging
also with SIMD. There you -- for every four tuples,
you spend 16 SIMD operations and four load/store
instructions.
What that means in the end is that you do a lot of
memory I/O compared to the very few assembly
instructions that you're doing for the sorting in
between. So in practice, sort-merge join gets
severely bandwidth-bound, so you're going to be bound
by all this memory traffic.
And I'll get to -- there was this question about
global and local sorting. We'll see that in a bit
where this problem gets even worse when you do it
wrong. Yeah.
>>: Can you even do it using hash for sorting?
we're in the hash ->>:
Like
Like [indiscernible] partition?
>>: And the partitioning and then you have the sort
[inaudible] on the hash [indiscernible] bucket.
>> Jens Teubner: Well, that's radix sort,
effectively, right? We didn't try rardix sort.
There is -- I haven't, like, really verified that
yet. You can do quite well with rardix sort as well.
I don't have a, like, apples-to-apples comparison
between the two. The group at Columbia is currently,
like, proposing that.
>>: It will be the same as creating the hash tables
as we did in the -- doing it in our previous hash
[indiscernible] and then look up and [indiscernible].
>> Jens Teubner: Ah-ha. You mean -- okay. Yeah.
No, we didn't try that. We didn't try that, yeah.
So that would be a combination of hash join and
sort-merge join. No, we didn't try that. That's
interesting.
It's a good idea.
Yeah.
Okay. So we end up with both phases actually being
severely latency-bound. Now, what can we do about
that? Now, things are still easy for the run
generation part, so I said you need to write in and
out these 64-byte chunks. In practice, that's not
really a problem because what you can do is you can
like, you have this merge tree and you, like, take
entire subtrees from the bottom, do them in one go,
and then you have no real memory traffic.
So you can do chunks of up to the size of your cache.
You can sort them really in cache and then memory
bandwidth is not an issue as long as you're in cache.
But for continuous merging, when you want to have
your entire thing sorted, you don't get around -- you
don't get around doing something about the high
bandwidth need of merging.
Now, one thing you can do is you, rather than just
doing two-way merging, this is what makes your
bandwidth demand so high because you need to
repeatedly do the merge, so you repeatedly need to
basically reread the same data.
But at the same time, two-way merging is highly CPU
efficient because you can optimize it so well with
SIMD and so on. So it turns out the bandwidth cost
is so high that it makes sense to give up a little
bit of the CPU efficiency by doing what we call a
multi-way merging thing. So what we did was we do a
merging over -- I mean, an N way merging operation,
which internally is broken down into a number of
two-way merges, because the two-way merges, that's
exactly that thing that you can do efficiently with
SIMD, right?
And now a single thread basically walks over all
these pairs, all these merging pairs, merges a little
bit here, merges a little bit there, as long there's
data there, right. It's like FIFOs in between,
little buffers, and then you try to -- this thread
walks over this merging tree and tries to produce
output, right?
And the idea is to keep all these intermediate
merging results, like streams, keep them within your
last level cache and have that thread again walk over
all these merges. Yeah.
>>: Are we trying to merge all the sort
[indiscernible] that we have to create a bigger run?
>> Jens Teubner:
Yes, yes, yes, yes.
>>: So do so you need to do that?
>> Jens Teubner: Well, yes. So I think two slides
on or so, so I think that kind of clarifies the
question. Yeah.
>>: Can you explain a little bit how this algorithm
works, because if I see that it's got [indiscernible]
you will not zero give me a buffer, even though it
will give me a buffer, which is bigger over this one.
213, same thing, right? So this transition from one
to the other NUMA to the same thread, you're paying
lots of cost for memory local, which is -- I mean,
effectively it wouldn't be as bad as the latency
issue.
>> Jens Teubner: Well, okay. First of all, again,
as I said previously, you cannot fully avoid crossing
the NUMA boundaries. So this thread sits on one NUMA
region, right? But at some point you have to merge
in data from the remote NUMA regions.
And by the way, this doesn't necessarily have to come
from separate NUMA regions. I just wrote it here.
The point is that you merge in from multiple sources.
Now, yeah. The thing here is that -- yeah, you save
overall bandwidth because originally you would have,
like, written out every -- here the merged results,
you would have written out that to memory, so you
save overall bandwidth. What you pay is that you
need a much more sophisticated algorithm that walks
over these binary merges. You have buffers in
between. You need to keep track of the fill level of
all of these buffers that causes quite a significant
CPU cost, but it saves bandwidth, and since we're
bandwidth-bound, there's a net saving in the end.
Okay.
>>: My only concern with this, this seems to be one
of a blocking. I mean, so when we're waiting on
buffers for other NUMA nodes, other NUMA nodes are
blocked basically, nobody's taking the buffers.
Right? I mean, if you think of this and I'm trying
to scale it into multiple threads, how can you ensure
that both threads are busy doing something useful
without blocking each other?
>> Jens Teubner: Well, you're not -- well, blocking
is probably not the right word. I mean, I agree,
while you're doing stuff here, there's no -- yeah,
right. But you have that in the -- when you just do
binary merging as well, right? You first do the
merging here and then you do the merging here.
And on top of that, I think the problem is not as bad
as it might sound, so -- because you start reading
here from that NUMA region, and reading from that,
and then the hover is going to see your sequential
read and will start prefetching, right? So when you
move over here, the hover is still going to prefetch
here and so ->>: Once you move there, you're moving to a remote
node, which means ->> Jens Teubner:
the same ->>:
No, no.
The thread always stays on
You're getting data from a remote node.
>> Jens Teubner: Yeah, but that doesn't make the
difference, right?
>>:
This has latency.
I mean, while maybe --
>> Jens Teubner: Yeah, but that latency is at least
partially hidden by prefetching. You merge in -- I
don't know -- 1,000 tuples from here and from here,
and then you go over here. But the hovering is
seeing, ah, he's just read 1,000 sequential tuples,
so let's just prefetch, huh? And then you do the
merging here, and by the time you come back, at least
parts of that data have already been prefetched, so
you don't suffer that full latency. You see what I'm
saying?
>>: My point is the local versus remote, right?
Accessing local memory versus accessing remote
memory, there's usually a three to four X ratio. I
mean, instead of doing [indiscernible] it taxes
memory from [indiscernible] three, it's got to be
three or four times slower than accessing memory from
the local NUMA. That's what I'm saying.
>> Jens Teubner: Yeah, I don't see -- I don't see
your point, really.
>>: So you're assuming that this -- these machines
support prefetch of things so that you can prefetch
something and then go off and do something else while
you wait for it to get loaded?
>> Jens Teubner: At least to some extent that's what
happened. Of course it's not, you know, it doesn't
magically do everything, right? But ->>: I'm trying to decide what the difference was
between [indiscernible] the fan-out of a merge versus
simply increasing the depth of the buffers.
>> Jens Teubner: The thing is when you -- okay.
Basically -- how can I say that? If you have a low
fan-out, then you have to repeatedly, basically, swap
out the stuff to memory again and reread it from
memory. So that's why you need much more memory
bandwidth.
>>: So it doesn't affect the number of remote
accesses that you make, right? It's going to be -that's going to be the same no matter what. It's the
number of local memory accesses that you make that
the fan-out changes.
>> Jens Teubner: Here you're doing basically -you're reading all your input data once and writing
it out once. If you had done just a binary merge,
you would read, write, read this and write, and then
again read and write. So you would go twice over all
your data, and that costs bandwidth.
>>:
Well, where is your buffers existing?
I mean,
I'm a little bit confused.
Where are the buffers?
>> Jens Teubner: Well, they end up being in the
cache of that thread, so that's why you choose your
buffers to have a size that matches your last level
cache.
>>:
So all those buffers have NUMA zero?
>> Jens Teubner:
Yes, yes.
Yes.
Yes.
>>: So the data here has been no way partitioned, so
at the end we will need the one full socket run?
>> Jens Teubner:
>>:
Yes.
Okay.
[inaudible]
>> Jens Teubner: Okay. One second. Okay. Just -I have just one slide that shows the difference that
it makes, so in particular, once you scale up to more
threads, then what you see here is that at some point
you really get no benefit anymore from more threads
because you just need so much bandwidth, you
completely saturate the memory subsystem, whereas
with this multi-way merging, we keep scaling.
The reason why scaling stops here is not the memory
bottleneck. It's because here the machine only has
32 physical cores, so over here we start seeing hyper
threads, okay. So it's a different effect.
Okay. So now this question about total sorting
versus non-total sorting, so there's indeed different
strategies, like do you want a global sorting at all?
What do we do about NUMA?
And there's opposing views on that, so one of the
views is the one of TU Munich, which is what I'm
trying to illustrate here. So what the folks from
Munich do is they take their input relation and then
do range partitioning. So they -- basically it's
hash partitioning, but with, you know, an identity
hash function and you look at a certain number of
bits and then do range partitioning and then you
know, you know -- so all the blue -- all the low key
values end up here and so on. And then you sort
locally in each NUMA region, right?
up with a fully sort of result.
And then you end
Now, the claim of the -- and here in the range
partitioning phase, you do have to cross the NUMA
boundaries, right? Whether you do this with merging
as I just sketched before or whether you do this with
range partitioning, that doesn't really make a
difference. You do have to cross the NUMA boundary.
>>: Using local sort and not using the caching will
actually increase our [indiscernible]. So what I was
talking about earlier that instead of this local sort
and partitioning [indiscernible] we could just do the
entire hashing and have several [inaudible] ready to
be merged.
>> Jens Teubner:
>>:
I didn't quite get that.
Okay.
>> Jens Teubner: So in the strategy of this MPSM,
massively parallel sort-merge join, is to not sort
the other relation, not globally sort the other
relation, but just do a local sort. Then you end up
with blue, red, green, and yellow data on each of the
cores and then you do the join as follows. On each
of the cores, of course, you do the same thing. You
do a sequence of joins, merge joins where you merge
this guy with that guy, and you then merge that upper
red guy with that red guy and so on, okay.
So that's one possible way of doing that.
performance numbers in a minute.
We'll see
The one downside of that is you don't end up with a
globally sorted result because you're merging in
values from the same range multiple times and then
those won't be globally sorted, okay.
Alternatively what you can do is you can do local
sorting first and then do the -- achieve a global
sorting with this merging operation, as I sketched
before, right? And then you end up with a globally
sorted result. This is kind of -- you can either
choose this or you can do the range partitioning and
then the local sorting first. It doesn't make so
much of a difference.
What we're proposing is to do the same thing for the
other relation, as well, so create globally sorted
data for both tables and then you can compute the
join completely local, so you don't have to do the
merging, the merge sort merging across the NUMA
boundaries, okay.
>>: So how much is data skew a problem here?
could be hard to get the partitioning right?
It
>> Jens Teubner: Actually, that's a good point. We
haven't really measured the skew problem yet. So
the -- with this range partitioning what you're doing
is you're building -- any way for partitioning, you
need to first build up a histogram of your data so
you know how large your partitions are going to be
because you need to allocate them in memory properly.
And then based on those statistics, you also can
choose your partition boundaries properly.
If you're doing something like that, you have to do
it slightly different, but at least here you have
sorted data so you know something about the data
distribution and then can choose your partition
boundaries properly. But that's something indeed
where with skew, that's something that's very high on
our priority list. It's true.
>>: So I have a question. The assumption here is
that both tables fit in memory, right?
>> Jens Teubner: Yes. So putting things together,
so previously I was talking about hash join alone,
and there we saw that this radix join has an edge
over -- I call it here in the eve. It's the
hardware-oblivious implementation.
Here for a larger data set that we looked at before,
this is the data set that was used by the TUM guys,
but you know, we saw similar numbers previously also
for smaller data sets.
Sort-merge, in comparison to that, what I'm showing
here is two sort-merge implementations. Basically
it's the two strategies that I had on the slide
before. It's the MPSM strategy and the one that uses
total global sorting. So this is MPSM and this is
global sorting. So our experiments show that global
sorting is actually advantages in the end.
>>:
What's the axis?
>> Jens Teubner: Oh, sorry. Oh, yeah, I forgot to
label the Y axis. That's throughput. That's million
output tuples per second. Sorry. Oh, yeah, that's
good point. Yeah. Sorry. Yeah.
>>: So one question. For this type of join, how
many euros qualify or [indiscernible] qualify?
>> Jens Teubner: Yeah, you have -- it's like -there's you a one-to-one match, yeah, yeah.
Okay. So what you can see here is that these two
implementations don't yet use SIMD acceleration.
Introducing SIMD brings quite a bit of an advantage,
and what we found makes a real difference is this
multi-way merging, so saving bandwidth really in the
end helps you quite a bit. And well implemented,
what we're seeing in our experiments is that
sort-merge gets close to hash but it's not quite as
fast as hash.
We previously had the situation that you have to be a
little bit careful about the data sets where you
reason over it, so this is the experiment that I just
showed before. If you take a different data set, the
picture does change. So this is the data set that we
saw before, gigabyte joined with a gigabyte. Here
are the -- well, this is to the advantage of the hash
join.
One thing that I would like to note, when you -- I've
seen several papers and talks about all this topic, I
was often surprised how confusing you can present
results, so what I'm showing here is how nicely you
can fool readers.
So a join, it's really unclear what is your -- what
is really your metric. Do you choose -- do you -- so
do you measure input tuples or output tuples per
second? And most of the papers chose to use output
tuples per second. The TUM paper chooses -- no,
actually, I did something wrong here now. No, this
is output and this is input. So this must be -- this
must be output and this must be input. Sorry.
So depending on what your metric is, you get really
different scaling properties. So what's being varied
here is that you keep one relation constant and you
increase the size of the other relation, so you
change the relative size.
And if you count output tuples per second, which was
the metric on all the slides before and in most of
the papers, then you're seeing these characteristics,
and if you count input tuples per second, you're
seeing this one here. Of course, the input tuples,
these are more than output tuples. So that's why you
get higher values here.
When you sit in a talk and see something saying -someone saying I have this many hundred thousand
tuples, 100 million tuples per second, be careful
what he's talking about.
>>: Question.
ABX or --
On the SIMD permutation, did you use
>> Jens Teubner:
>>:
[inaudible]
>> Jens Teubner:
>>:
What?
You get X1 actually, right?
>> Jens Teubner:
right?
>>:
Yes, yes.
Yeah, IBX2 is not yet available,
[inaudible]
>> Jens Teubner: Yeah, yeah. So we're actually
looking very much forward to try out on -- yeah.
>>:
Okay.
>> Jens Teubner: Yeah, we don't have hardware yet,
but we are very much looking forward.
We actually played around a little bit. There was
this question before about if sort-merge starts to
become attractive for joins, what about the other
operations such as aggregation, so here is -- so what
I'm comparing here is the state-of-art plat algorithm
for -- which is hash based for aggregation, and you
see the throughput depending on the number of
distinct groups for a group operation.
For few groups hashing gets very cache efficient, but
at some point it just gets expensive.
Here we see that sort-merge is still, you know,
depending on the commonality, it can be significantly
slower, but it is very, very robust. I mean, this is
something that's, I guess, known also from classical
databases.
These are two implementations, one that's just a
plain sort and then aggregate, and the other one does
partial aggregation in between. When you see in
intermediate runs, you have sequences of the same
value, you can aggregate them right away.
>>:
[inaudible]
>> Jens Teubner: Plat does that to some extent,
yeah.
Okay. With that, let me conclude. So basically I
argued about two things. One was about are
hardware-conscious optimizations worth the effort.
This is what I focused on a little bit on the hashing
side. And our current results indicate that yes, it
is really worth the effort. They're much more robust
than you might think and they're simply faster.
We found that hash join is still the faster
alternative, but certainly merge join is getting
close. What is like hard to evaluate numerically is
that sort-merge has this nice property that it
produces sorted output, which might be useful for
upstream [indiscernible] processing, right? So that
might be another argument for sort-merge joins.
What we're really looking forward to is have a look
at future hardware. Haswell has been announced also
with I think at least gather support, so in the SIMD
registers you could read from multiple memory
locations with one instruction, which, of course,
favors hashing, so we have -- we're going to see
wider SIMD which favors sort-merge, but we might also
see functionality in that direction, and what does
that mean to the relative performance.
With that, I'm -- oops.
crashed. Sorry. Yeah.
Okay. With that, my app
[applause]
I wanted to say a thank you to in particular to
Charlie, the Ph.D. student who did that. Questions?
I used up almost all the time. Sorry.
>>:
Thank you very much.
>>:
Thank you.
[applause]
Download