>> Ken Eguro: Good afternoon. Today it's my pleasure to introduce Jens Teubner. He is leading the database information systems group at TU Dortmund University in Germany. His research interests include -- well, are really data processing on modern hardware platforms. This includes FPGAs, multi-core platforms and hardware accelerated networks. So with that I'll keep the instruction short. Jens. >> Jens Teubner: Yeah, thanks a lot, and thanks for all coming. And, of course, feel free to interrupt me. I've given this talk a little bit of a catchy title: Sort versus Hash. That's one of the longstanding debates in the database community, it's like a pendulum swinging back and forth what is to be preferred as an approach, and I'm going to look at this discussion in a main memory context, so I didn't exactly know how deep the database knowledge is in that audience, so basically for all of the core database operations, be it aggregation grouping, duplicative elimination, or joins, they basically all boil down to very similar implementation which can be based on sorting or hashing. In this talk I'm going to focus on joins, so I'll, for the most part of the talk, I'll talk about joins, so we want to implement joins in main memory as efficiently as possible. And again, there's basically two approaches. One of them is to use hashing, so a hash join. I just, as a recap, try to visualize it. What you do is you have to input tables and what you do is you scan one of the two input tables. While scanning the input table, you populate a hash table with all the tuples from R, and then in the second phase of the algorithm, you scan the other relation, and for each tuple probe into that hash table. Hash join is popular because it has nice complexity properties. I wrote here a little bit fuzzily O of N, so it's linear in the sum of the sizes of the two input tables and it's very nice to parallelize. We'll see that in a bit. Alternatively, and that's what I think databases that run off disks prefer today, is the so-called sort-merge join, so what you do is you have your two input tables. They are in some order. So as a first step of the algorithm, you sort both input tables and once they are sorted, joining all of a sudden becomes trivial, by just merging the two, so that's the so-called sort-merge join. It's used a lot, I think, when you deal with external memory because databases have techniques to deal with -- to sort arbitrary amounts of data quite efficiently. Strictly speaking, its complexity properties are not quite as nice, but we'll see actually in the end they all boil down to be very similar. Now, I said I'm going to look at both sides in a main memory context, so the question is which of the two strategies do you want to prefer in a main memory context. Classically, hash join has always been the preferred choice because -- well, if you take random access memory in its literal meaning, then this random access pattern of the hash doesn't hurt you when you run stuff from main memory, and then you prefer the simplicity and the nice complexity properties. But the question is how does this look like on modern hardware, and that has, like in the past years, led to quite a bit of discussion in the community which of the two strategies should be preferred. I think one of the quite well known papers here is one that a team with people from Intel and from Oracle did was they did a study where they wanted to compare the two and they tried to make predictions of what's going to happen in the years to come. And what those guys found out was that hash join, in their setup, was the method of choice. It was about two times faster than sorting, but they predicted that as hardware emerges or evolves, it's quite likely that very soon sort-merge is going to overtake hashing in performance. And that is because of the increasing SIMD width for the vector instruction capabilities of modern CPUs. Those tend to be a very good fit for sorting, but not so much for hashing, that favors sort-merge join. And their projections from about four years ago were that if you have an architecture with 256-bit SIMD registers, then both should be about the same speed, and after that SIMD vector lengths continue to increase. Sort-merge is going to be faster. Now, more recently a group from Wisconsin, they looked at hash joins only, and they sort of did conclusions only within the hash join world. We're going to see what exactly this statement here means. What they basically said is that a relatively eve [phonetic] of implementation of hash join is superior over implementations that are highly specialized to modern hardware, highly optimized to modern hardware. I listed that also because even more recently there was another paper that did compare sort-merge and hashing. That's a group from TU Munich. They took these results here, did a transitive conclusion, so they found that their sort-merge join algorithm is faster than this hash join implementation, and by transitivity, this is also faster than this one, which means sort-merge join is the method of choice. That's kind of what they say. So this is all assume that we have similar size relations on both sides or -- because you can make hash join look better by if you have a tiny build site, right? >> Jens Teubner: We're going to see that indeed a little bit, right? And so, in fact, that kind of brings us to what I want to do in this talk. I want to -- of course, the comparisons on that previous slides, they were really apple-to-oranges comparisons, right? And particularly these transitive conclusions, right? You choose different -- you prove something on one workload and then the transitive conclusion on another workload, so from that you can conclude anything. So ->>: So the N log N, of course, will eventually, if the N is large enough, dominate any constant factor which you might get from the linear one, so depends on how big the relations are overall what you're going to get. >> Jens Teubner: >>: Sorry. I didn't -- Sort-merge is N log N. >> Jens Teubner: Right. >>: That will eventually dominate any algorithm and in terms of its total cost would be larger. >> Jens Teubner: Yes, but it turns out that efficient hash join implementations typically also you do -- we'll see that you do multi-pass hash joins and then you end up with a log N factor again. So it's not really clear whether that complexity argument really makes sense. Okay. So what I'm going to do is I'll try to -- I'll do apples-to-apples comparisons and I'll sketch a little bit, like implementation details that we found out to matter and we try to -- or I tried to in the end make a little bit conclusions about where are we in terms of the sort versus hash discussion. So the agenda for today is I'll first look at hash joins in more detail, then the sort-merge joins, and then I'll compare the two, yeah. Okay. So again, hash join is nice because it's so nice to parallelize, so what we're going to look at is, of course, we look at modern hardware, so that includes multi-core parallelism. Of course, this is just a slide to show that hash join is basically trivial. To parallelize, you just chunk out both input relations into partitions that you assign to individual cores and then you can do the hash table build and the probe in parallel. Of course, still one after the other, but then you can use all your cores to do a join build and a join probe. This in principle works very well. That's because you do need a locking mechanism here because that, in the build phase, the hash table is accessed concurrently, but you benefit here from the -- I mean, the commonalities are easily in the millions, right? And then the chances of getting a collision on that hash table is basically negligible. So this parallelizes very well. You won't see any lock contention in practice, just because there's so many hash tables, hash table buckets in practice. So this is like the basic hash join implementation that we're going to look at first, and this is what this group from Wisconsin claims to be the preferred one because it's fast and because it's simple. Now, if we first look at -- so what we did was we took their implementation, analyzed it carefully, and tried to see whether the comparisons that they make are actually valid, whether they make sense or whether they're doing apples-and-oranges comparisons. And for that what we first did was we tried to see whether there's any kind of situations where we can improve their code. And I have one example here just to show you that you have to be very careful to implement such an algorithm, otherwise you're going to suffer quite significantly from facts in modern hardware, and one of these examples is -- so what I'm showing here is the hash table implementation that the people from Wisconsin used. So in order to implement that hash table, you need some locking mechanism, so you have to -- they implemented that as a separate latch array, and then the hash table consists of a hash directory that points into individual buckets that are allocated from a heap. Now, when you do any operation on that hash table, what it means in the end is you need to access three different locations in memory. You need to first get a latch, then you need to follow the pointer, and then read out your data, which amounts to three memory accesses per tuple; and, of course, you can also do that with just a single memory access per tuple by simply collapsing that into a single data structure. And why that is important you see on this slide here, so when you start analyzing the performance of that implementation on modern hardware, what you see is that -- so the amount of work that you have to do per input tuple is actually extremely small. So looking at the hash table build phase here, what you need to spend is per tuple about 34 assembly instructions, and now it makes a real difference whether during those 34 assembly instructions you see one or three cache misses. In practice we saw something around 1.5 cache misses. That was because -- well, it depends on how you cache line, align your data. You can also bring that down to one if you want to, and we saw TLB misses quite a lot. >>: So these numbers are [indiscernible] modified -- >> Jens Teubner: Exactly, yeah. These are for this -- so here you see three misses. Here you see 1.5. The 1.5 comes from -- so if an allied address starts here, then this is the first row, and the next one will come right after here and it will span over two cache lines, and then you get two misses for just a single read. And then you can do a tradeoff. We play around a little bit with that. You can -- well, we wanted to keep the number of tuples per bucket consistent; otherwise, you introduce another parameter which, you know, you can vary around arbitrarily. You can blow up a little bit so you spend more memory, which causes you more cache miss -- more cache occupation on one hand side, but with aligned access, you can reduce the amount of cache misses per tuple, but performance-wise it didn't make a huge difference. Okay. So either way what you end up with is you have in the order of 30 CPU cycles work to do and that -well, and at the same time you need to do -- well, 1.5 memory accesses plus a bunch of TLB misses that you'll suffer. So here you're easily talking about 200 cycles or so that you need, right? So in the end, you're going to be severely latency-bound, right? All the costs you're measuring is those cache misses. And that's why in practice or we think that such an implementation on modern hardware won't make sense. People have realized that earlier. One way to avoid these many cache misses is to choose a partitioned hash join instead. So the idea there is that before you build up your hash table, you break up your problem into sub-problems, so you use hash partitioning to break up your input relation in cache-sized chunks. You do that for both input tables, huh? And then you basically need to only join corresponding partitions, and if those are small enough, then you can keep all the processing for one such pair of partitions inside the cache and you won't suffer any -- you won't see any cache misses anymore here, huh. So once you do that, you can -- yeah, each of those joins basically runs out of caches and you will see basically no cache misses anymore. Again, you have to be careful when you implement that. So what the Wisconsin guys did is they took this very literally. They partitioned, build all the hash tables, and then probed into all the hash tables, huh? What this means is that you build up this hash table, you build up all those other hash tables, and eventually you're done with all your hash tables. And then you go back up here, huh? And by the time you want to access that hash table, it's, of course, gone long since from your cache, right? If instead you interleave the build and probe phases, you can reach a situation where you build out that hash table, it sits in the cache. Then you probe from that side, huh? You use the data that's readily in the cache, and then you move on to the next partition. And again, you can save a lot of cache misses from that. >>: So is this [inaudible] so you have both relations [indiscernible]? >> Jens Teubner: Yes, yes, yes. I mean, throughout -- I assume everything's in main memory, yeah, yeah. Yeah, yeah. Yeah. I didn't say that explicitly enough. So what I'm assuming is I have everything in main memory. We also assume like a pretty much column-oriented model, so we assume very narrow tuples, so you saw that here. So we assume in that case, in that picture, we assume 16-byte tuples only, eight ->>: But if one of the sites is really another join, then do you get stemming there? >> Jens Teubner: You get what? >>: One of the sites -- you can have multi joins going on, right? >> Jens Teubner: >>: Right. You can have R join S join T. >> Jens Teubner: Right. >>: The R join S fitting into the top join, you have to keep the result of that also in memory, right, is what you're saying? >> Jens Teubner: Well, but a join is blocking anyway. I mean, you anyway have to ->>: But what I'm saying is it's good enough to have R and S in memory. Now you have I join S also in memory. To compute I join S joined ->> Jens Teubner: Yeah, but you need to -- I mean, either way, you need to keep the -- either way you -when you build up the hash table over R join S, you need to keep that in memory anyway, yeah. >>: But [inaudible] extend, you would only keep one of the sites in memory? >> Jens Teubner: >>: Okay. Okay. Yeah. Okay. [inaudible] space -- >> Jens Teubner: Okay. Yeah. I mean, this is really assuming, you know, memory is cheap, huh? Just have everything in memory, huh? Yeah. Which I think, you know, is increasingly reasonable, I mean, huh? I mean, you guys are working on in memory processing, as well, right? So. Okay. Now, once we do that, we indeed basically avoided almost all cache misses, yeah? You still have some -- I mean, at some point you still have to bring stuff in cache, so you do see a bunch of cache misses, but not many. And at the same time, the nice thing of this -- about this partition hash join is that after partitioning, you end up with a lot of independent work, so no need for a shared hash table anymore. You can just do the R1 join S1, R2 join S2, you just do those on separate cores, and there's no need to synchronize through locks anymore. >>: Can you explain a little bit more why it's [inaudible] -- do you sort them or something or, like, how are partitions created? >> Jens Teubner: Well, you can only find matches between here and here, but not between here and here, okay? >>: What way? >> Jens Teubner: Because you hash-partitioned here. >>: [inaudible] like a regular ->> Jens Teubner: So you use your key value as a partitioning function, all coupled with the same key will go into the same partition and you use the same partitioning key on both sides, okay? So you will only find matches between, you know -how can I say that? -- neighboring partitions here. And then -- yeah, you can run this join on one core and this join on another core and they won't interfere at all with each other. No need to protect with locks or anything. And that makes your code even simpler, so we end up with 15 to 21 instructions per tuple for build and probe phases. Almost no cache misses anymore, so this join -- I mean, we've reduced quite a bit of the cost, yeah. >>: So can you go back to the [inaudible] -- I'm still not really -- I don't know exactly what the difference between the picture here and the picture you displayed. Is that each of your, like rows is now big enough so that I don't need to follow [inaudible] to into a bucket? Is that the main difference, or I'm lost. >> Jens Teubner: Well, you -- because you partitioned this stuff here, these hash tables are now hash tables over just a fraction of the input. In practice you don't do just four partitions. You create thousands of partitions, okay? >>: How is each partition subject to [indiscernible] if I put the partition as a [indiscernible] I would be in the same [indiscernible] instruction. >> Jens Teubner: Oh, this is just a part of your table. It's just, you know ->>: So R4 plus S4 is [indiscernible] -- >> Jens Teubner: Yes, it's actually enough -- it's enough to have the hash tabling cache, so R4 has to fit in the cache, assuming that the hash table is approximately the same size as the table, right? The hash table has to fit into the cache. That's enough, right? So you break up this input relation into chunks that are cache sized. Suppose you have -- I don't know -400 megabytes here, you have a four-megabyte cache, you create 100 partitions, and then for each of the partitions the hash table will fit in the cache, and then you can run the hash join from the cache. >>: You don't get any misses, TLB misses anywhere, so the next time you said there's 0.0 ->> Jens Teubner: Well, it's basically -- it's, of course, you always do get some, right? There's always some compulsory misses, right? But these are, you know -- >>: How do you avoid this in the probe? >> Jens Teubner: cache. During probe, everything's in the >>: But the assumption -- I mean, after your hash, basically you have pointers with random addressees, right? After you hash, they say that these pointers are sequential. >> Jens Teubner: Yeah, yeah, but they're all -- you know, this is small. This fits into the cache and once stuff is in the cache, then that's what the cache is for, right? Once you keep your -- all your -- I mean, on both sides, once you keep that contained in the cache, then you won't suffer any cache misses anymore, and the TLB also is enough to cover the entire hash table if the hash table is small enough, yeah. >>: I think it's [indiscernible] but the partitioning [indiscernible] ->> Jens Teubner: >>: Exactly, yeah, yeah. [inaudible] >> Jens Teubner: That's what's coming next. Yeah. That's what's coming next, yeah. So what I did not yet tell you is that partitioning, itself, introduces a new cost, and that's where the TLB indeed comes in. So if you do partitioning, then it depends on the number of partitions that you want to create concurrently. Partitions are roughly cache-sized, so it's safe to assume that one partition is larger than a memory page. A memory page is just four kilobytes, so in the end, each partition will reside on a separate memory page, so you need a TLB entry if you want to do partitioning, you need one TLB entry for each of the partitions in order to write into those partitions. And we have some experiments for that, so when you start -- so when you do partitioning, so these are assuming just a single partitioning thread, now you vary the number of partitions that you want to create, so this is the logarithm of the number of partitions, so you create two to the power of four up to two to the power of 16 partitions. And what you can see here is that for a small number of partitions, partitioning can be done very, very efficiently, about 100 million tuples per second. That's because partitioning it's -- conceptually it's very little work. You just compute a hash value over the key, that tells you where to put the tuple and write it there and then you're done. It's not much work. But once did you ->>: But it does have cache-missed properties too, of course. >> Jens Teubner: Yeah, the cache misses are not really problematic because you're talking about here, let's say, two to the power of 14 addresses where you're concurrently writing to, so you need about two to the power of 14 cache lines, right? And that's not much data. That's -it's a few kilobytes, so because you partition sequentially, right? It's whatever. If you would do that in a it's -- yeah, write into each like having -file system -- >>: Why assuming that you write in the partitions [indiscernible]? I see how you could write it into the partition sequentially, but the actual process of partitioning is not sequential; that is to say ->> Jens Teubner: >>: Yeah. -- as you process a tuple -- >> Jens Teubner: Yeah, yeah, but for each partition, you only need to keep like the next position in memory and those next partitions, those aren't many. Those are -- that's not much data in total. So you can get that in a cache, right? Like for each partition, the slot for the next tuple to insert, that's just two to the power of 14, say, slots. That's a few kilobytes of space, right? You know, you can -- like in the classical database, you can create -- you can create partitions over terabytes of data with just a very small buffer pool because all you need to keep in the buffer pool is just the next page to write out for that partition, but you don't need to keep the full partition or something in the cache. Does that make sense? You're skeptical still? >>: Well, the total number of misses is the total size of however many [indiscernible] you have divided by the number of entries that you've used to fit on a single cache [inaudible] effectively. >> Jens Teubner: Yeah. But -- yeah. But, you know, the -- how can I say that? You're writing into each partition sequentially, so very likely you like have for each of those precisions, the area around your current writing position, you have that in your cache, so you don't need to -- you don't -- and then writing there gets quite efficient because you just need to write into the cache, and then if that cache line is full, it's being written out to memory, but you don't even have to wait for that. You can still continue -- it's write only. >>: But the total number of entries divided by however many [inaudible] ->> Jens Teubner: Right. >>: Fill, fill, fill, fill, now it's full. Now I've got to chunk them out. Okay. >> Jens Teubner: Right. But that's different, like when you -- during a hash table build or probe where you randomly put individual tuples in there, then typically you have to bring in the cache line, do the update, and bring out -- write out the cache line again, because you don't write those -- you don't have this sequential pattern as you have it in partitioning. I don't know if I'm making sense. Okay. Now, the problem is, of course, we want to create many partitions because we want to have those partitions small so they fit into the cache. The smaller the better so we can bring them in all tour [phonetic] or something. In practice in our experience what we would like to be -- for the data sets we used, you would like to be around here, so to the power of 14 partitions for the workload configurations that we had, this was a good choice. But, you know, in this picture it doesn't matter much whether, you know -Now, what can we do about this? And the classic way of doing that is multi-pass partitioning or also called radix partitioning. It's basically an idea that's been done also in disk space systems. You're limited by the number of partitions that you can write concurrently. So if you can write at most, say 100 partitions concurrently, you can just do multiple passes and refine your partitions again and again, and this way you never exceed your maximum number of partitions per partitioning pass and still end up with as many partitions as you want in the end. And that's where the log join, right, because now partitioning in multiple passes is logarithmic in have. N factor gets into hash you have to start passes and the number of the amount of data that you >>: In some respects it's all partitioning, the last step included? >> Jens Teubner: Yes, yes. It's true. It's true. Yeah, it's basically the same operation, hash table building and it's true. Absolutely. >>: You can terminate it anytime you figure your bucket is small enough. >> Jens Teubner: true. It's true, yes, yes. Yes. It's Now, if you do that, it turns out the cost of an additional pass, it's additional CPU work that you do, but it saves you all those TLB misses, and as you can see here for the area where we're interested in, it gives insignificant advantage in terms of the partitioning cost, right? You can do better even. The trick here is that -- so this is how you implement partitioning easily. You take the key, hash it, and then just write it where it belongs to. What you can do alternatively -- and then this is always a write, right? And for that you need a TLB entry for writing that element out to memory. What you can do instead is like create like a little buffer in between. I've written this as an algorithm here. So you would hash your key, not write it out directly to memory, but first fill up an in-cache buffer, and only when that in-cache buffer is filled up, then you flush out the entire buffer slot out to your partition. So for each partition, you have a little buffer, you fill up the buffer, and only when the buffer is full, you flush it out to the partition. What this means is that if you can have something like eight tuples in each buffer slot, you get a memory operation or an operation that goes all the way to entry that needs a TLB entry. You need that only every eighth tuple, so you can push down the number of TLB misses, and once you have sufficiently few TLB misses, then they won't cause any harm anymore because then the -- all the out of order mechanisms in your CPU will be able to hide the latency that you get from those TLB misses, right? You can just keep processing while the hardware just does the TLB miss for you. In practice you choose your buffers to match the cache line size because then you can do those transfers particularly efficient. >>: The effectiveness of this buffer mechanism is only as good as your hashing function, right? Because if your hashing function is [indiscernible] all over, you might wait too long and waste lots of memory. >> Jens Teubner: Yes. >>: So was there any correlation, anything between your actual data and the hashing function or ->> Jens Teubner: Well, you know, we don't have -- as an academic, we don't have real-world data, so we were dependent on synthetic data anyway, right? Once you operate on synthetic data, then, you know, you can use anything for a hash function because, you know -- yeah, you don't have that nasty type of correlation in your data. So we kind of -- I cannot really ->>: It's true. I mean, it depends on whether you're dealing with distinct values or not. >> Jens Teubner: So what we did do -- >>: If you're not dealing with distinct values, then you have all sorts of distributions. >> Jens Teubner: Yeah. What we did do -- but I claim that's kind of independent of the hash function, right? If you have duplicate values in your data, yes. We -- I didn't bring experiments with me, but we did measurements with skewed data, but still, then, you know, you don't have funny correlations between the data that would affect the hash tables, so how can you say that? >>: [inaudible] you have sequential keys and your hash is like CSE32 or something. Each load will go a totally different place. So this buffer won't be usable in some sense. So ->>: Oh, because the radix that you use for the hash depends on how many you can fit. >>: So I'm trying to understand was there any knowledge of the incoming data to make the hashing function friendly enough to leverage the buffer or ->>: The buffer -- >> Jens Teubner: Not because we -- the moment you generate synthetic data, I claim there always -every hash function is friendly enough. >>: Okay. >> Jens Teubner: Unless you build very funny -- I mean, very artificial distributions where, you know, if you -- so we used a zip distributed data, but we permuted our alphabet beforehand. Otherwise, you do get very strange effects, but they don't reflect reality either. >>: Okay. >> Jens Teubner: Yeah. >>: Again, if you have nondistinct keys and they're distributed in some way so that you get 50 percent of the keys are all the same or 25 percent of the keys are all the same, you're going to have very bad skew ->> Jens Teubner: Yes, yes, yes, yes. We did do measurements. I didn't bring them with me, but I can say a few words once we -- so actually, we'll soon see a bunch of experiments, right. So once you do this, it turns out that with a single pass you can actually, over quite an interesting range of partitioning fan-out, you can do with a single pass even better than with radix partitioning, which is I think -- well, up until now has been always considered sort of the strategy of choice. So putting this together, what I'm doing -- what I did here is I show some measurements that we did comparing this -- what the Wisconsin people called no partitioning join that is a hardware oblivious join, not considering any caching effects or anything, versus this radix join which does the partitioning and all that stuff. What you see is not partitioning, indeed, gets responsible for a large share of the cost, whereas once you partitioned, doing the actual join gets negligible cost. So this is the number of CPU cycles per output tuple of your join. And, yeah, contrast to that because of the many cache misses, of course, that you lose a lot of time in the build and probe phases of the no partitioning algorithm. This is a benchmark configuration that we took from the Wisconsin paper, and to give you a feeling, so on our Nehalem machine that we use it's like -- I think it has four cores, you get in the order of 100 million tuples per second in join throughput. >>: Do you have another chart which shows the [inaudible] cache, I mean, through level cache size for each? I was wondering if there was a correlation between the cache size and the [indiscernible]. >> Jens Teubner: >>: So -- I don't remember, no. [inaudible] >> Jens Teubner: Yeah. >>: Can't remember the numbers [indiscernible], so I'm just wondering if there's a correlation between cache size, this algorithm, and these ->> Jens Teubner: Well, this difference here comes from the Sandy Bridge having much more cores. So this is a parallel implementation, and then you -basically this is a measure of wall clock time kind of, so you have more cores, you need less cycles per tuple. >>: My assumption was the number of cores used was fixed, was just measured ->> Jens Teubner: No, no, no, we -- no. Yeah. So this was -- so basically we used -- in all those machines we used one socket but used that, I mean, fully. >>: Okay. >> Jens Teubner: And there was this question earlier about the impact of -- first of all, so to show you that, you know, we work carefully, so we really tried hard to make both implementations as fast as we could. We think we can show that by showing that our implementation is roughly a factor of three faster for both cases, approximately, as compared to that existing paper. Now, what you see here is that -- well, the Wisconsin guys, based on a picture similar to that, actually, they had just those two sides, they argued that the hardware oblivious implementation is still preferable over a hardware optimized implementation because depending on the platform, chances are higher to get -- to pay a penalty here as compared to the little benefit you get here. Now, once you go to a different workload, so this is the workload that has earlier been used in the literature, all of a sudden the game changes quite a bit. So these are two equal sized relations and the hardware-conscious implementation clearly has an edge over the hardware-oblivious implementation. The hardware-conscious implementation also scales very nicely, so this is the Sandy Bridge machine with eight cores on one socket, and if you start scaling that up on a four-socket machine, you can reach cache throughput -- join throughput of about 600 million tuples per second. There was the question previously about skewed input data. It turns out that the hardware-optimized version, this radix partitioning turns out to be much more robust towards skewed distribution, so you get about equal performance over quite a range, a wide range of skew values. For -- so the hardware-oblivious implementation is quite skew sensitive. There you typically get an advantage from skew because your caches get more effective because you hit the same cache line more often. Okay. So our conclusion from that is that hardware-oblivious algorithms, you know, if you choose the right benchmark, they do look good, but -so we have an ICD paper on that. There's some more experiments in there and those all indicate that with hardware-conscious algorithm, you can typically do much better. And interestingly, we found that these hardware-conscious implementations are extremely robust also to the parameters that you choose, so there's a few parameters in there, like cache sizes and number of partitions and all that stuff, yeah? And of course, getting those parameters right might affect the performance that you see, but it turns out it's actually the algorithms are quite robust toward that, so a slight missetting of those parameters won't kill your performance at all. It doesn't make much of a difference. Okay. Let's then have a look at sort-merge joins. So sort-merge joins, essentially there all the cost boils down to sorting. Once the data is sorted, doing the join becomes almost trivial. Sort-merge -- and then the sorting part, you typically implement as merge sort. Sorry for the name confusion, yeah? That is you first create some presorted input runs and then you merge them like recursively until you end up with a sorted -- with sorted data. In practice on modern hardware, I try to visualize how this looks like in practice. In practice what you do is you -- if you have a large system that's going to have a NUMA memory architecture, so here on the graph I assume you have four NUMA regions. What you then -- or one way of doing the sorting then is to perform first, so you assume the input data is somehow distributed over all your NUMA regions. Then you typically do first a local sorting within each NUMA region to avoid memory traffic across the NUMA regions. Then if you want to obtain a globally sorted result, which I'm assuming here, then what you need to do is you need to merge across those NUMA regions as shown here. So you do the two-way merge here and another two-way merge -- well, two two-way merges here and then another two-way merge here until you end up with the sorted result. >>: But you think that you have [inaudible], right? >> Jens Teubner: Pardon? >>: The penalty of local [indiscernible] NUMA, it's being paid in a second step anyways? >> Jens Teubner: Yes, yes, yes, yes. But, you know, you want to avoid, you know, moving data back and forth for something, right? So here at least clearly -- you know, if that is the result that you want to have, there's no way around getting this data over here. So you need to cross that NUMA boundary somehow, if that is what you want as a result. >>: So you don't get everything completed sorted, though, right? You can have ->> Jens Teubner: Yeah, we'll get to that. to that. That's true, yeah. Okay. We'll get Now, I was already mentioning here the bandwidth that you need to do the merging, right? And in fact, the merging is a pretty critical part in that sort operations. If you -- so for both parts, for run generation and for merging, what you can do is you can implement them quite efficiently using SIMD acceleration based on sorting networks, which is where this Intel work said in the future this run generation and merging is going to become even more efficient if you have wider SIMD registers. Just a comment here, you know, like in hash join partitioning was similar to hash table build-up, here you have a similar thing that merging for merge sort is a very similar operation to the join afterward. So there is a difference in the sense that it's tricky for that part to use SIMD because your tuples look differently and that makes it tricky to use SIMD, but other than that, it's a very similar task. So what we're now interested in is the merging part mostly. Here I have some numbers for both parts, for run generation and for merging in the merge sort sense, yeah. If you implement this well using SIMD operations, assuming you create runs with four items each initial runs, you can do this with 26 assembly instructions, and for each of these assembly instructions, you -for each of these -- for each -- so with 28 -- 26 assembly instructions, you can sort four sets of data. And then assuming four byte keys and four byte values, you end up with reading in 64 bytes and writing out 64 bytes. So in total you -- there is a ratio of 26 assembly instructions to 128 bytes moved between memory. Similarly for the merge pass, so you can do merging also with SIMD. There you -- for every four tuples, you spend 16 SIMD operations and four load/store instructions. What that means in the end is that you do a lot of memory I/O compared to the very few assembly instructions that you're doing for the sorting in between. So in practice, sort-merge join gets severely bandwidth-bound, so you're going to be bound by all this memory traffic. And I'll get to -- there was this question about global and local sorting. We'll see that in a bit where this problem gets even worse when you do it wrong. Yeah. >>: Can you even do it using hash for sorting? we're in the hash ->>: Like Like [indiscernible] partition? >>: And the partitioning and then you have the sort [inaudible] on the hash [indiscernible] bucket. >> Jens Teubner: Well, that's radix sort, effectively, right? We didn't try rardix sort. There is -- I haven't, like, really verified that yet. You can do quite well with rardix sort as well. I don't have a, like, apples-to-apples comparison between the two. The group at Columbia is currently, like, proposing that. >>: It will be the same as creating the hash tables as we did in the -- doing it in our previous hash [indiscernible] and then look up and [indiscernible]. >> Jens Teubner: Ah-ha. You mean -- okay. Yeah. No, we didn't try that. We didn't try that, yeah. So that would be a combination of hash join and sort-merge join. No, we didn't try that. That's interesting. It's a good idea. Yeah. Okay. So we end up with both phases actually being severely latency-bound. Now, what can we do about that? Now, things are still easy for the run generation part, so I said you need to write in and out these 64-byte chunks. In practice, that's not really a problem because what you can do is you can like, you have this merge tree and you, like, take entire subtrees from the bottom, do them in one go, and then you have no real memory traffic. So you can do chunks of up to the size of your cache. You can sort them really in cache and then memory bandwidth is not an issue as long as you're in cache. But for continuous merging, when you want to have your entire thing sorted, you don't get around -- you don't get around doing something about the high bandwidth need of merging. Now, one thing you can do is you, rather than just doing two-way merging, this is what makes your bandwidth demand so high because you need to repeatedly do the merge, so you repeatedly need to basically reread the same data. But at the same time, two-way merging is highly CPU efficient because you can optimize it so well with SIMD and so on. So it turns out the bandwidth cost is so high that it makes sense to give up a little bit of the CPU efficiency by doing what we call a multi-way merging thing. So what we did was we do a merging over -- I mean, an N way merging operation, which internally is broken down into a number of two-way merges, because the two-way merges, that's exactly that thing that you can do efficiently with SIMD, right? And now a single thread basically walks over all these pairs, all these merging pairs, merges a little bit here, merges a little bit there, as long there's data there, right. It's like FIFOs in between, little buffers, and then you try to -- this thread walks over this merging tree and tries to produce output, right? And the idea is to keep all these intermediate merging results, like streams, keep them within your last level cache and have that thread again walk over all these merges. Yeah. >>: Are we trying to merge all the sort [indiscernible] that we have to create a bigger run? >> Jens Teubner: Yes, yes, yes, yes. >>: So do so you need to do that? >> Jens Teubner: Well, yes. So I think two slides on or so, so I think that kind of clarifies the question. Yeah. >>: Can you explain a little bit how this algorithm works, because if I see that it's got [indiscernible] you will not zero give me a buffer, even though it will give me a buffer, which is bigger over this one. 213, same thing, right? So this transition from one to the other NUMA to the same thread, you're paying lots of cost for memory local, which is -- I mean, effectively it wouldn't be as bad as the latency issue. >> Jens Teubner: Well, okay. First of all, again, as I said previously, you cannot fully avoid crossing the NUMA boundaries. So this thread sits on one NUMA region, right? But at some point you have to merge in data from the remote NUMA regions. And by the way, this doesn't necessarily have to come from separate NUMA regions. I just wrote it here. The point is that you merge in from multiple sources. Now, yeah. The thing here is that -- yeah, you save overall bandwidth because originally you would have, like, written out every -- here the merged results, you would have written out that to memory, so you save overall bandwidth. What you pay is that you need a much more sophisticated algorithm that walks over these binary merges. You have buffers in between. You need to keep track of the fill level of all of these buffers that causes quite a significant CPU cost, but it saves bandwidth, and since we're bandwidth-bound, there's a net saving in the end. Okay. >>: My only concern with this, this seems to be one of a blocking. I mean, so when we're waiting on buffers for other NUMA nodes, other NUMA nodes are blocked basically, nobody's taking the buffers. Right? I mean, if you think of this and I'm trying to scale it into multiple threads, how can you ensure that both threads are busy doing something useful without blocking each other? >> Jens Teubner: Well, you're not -- well, blocking is probably not the right word. I mean, I agree, while you're doing stuff here, there's no -- yeah, right. But you have that in the -- when you just do binary merging as well, right? You first do the merging here and then you do the merging here. And on top of that, I think the problem is not as bad as it might sound, so -- because you start reading here from that NUMA region, and reading from that, and then the hover is going to see your sequential read and will start prefetching, right? So when you move over here, the hover is still going to prefetch here and so ->>: Once you move there, you're moving to a remote node, which means ->> Jens Teubner: the same ->>: No, no. The thread always stays on You're getting data from a remote node. >> Jens Teubner: Yeah, but that doesn't make the difference, right? >>: This has latency. I mean, while maybe -- >> Jens Teubner: Yeah, but that latency is at least partially hidden by prefetching. You merge in -- I don't know -- 1,000 tuples from here and from here, and then you go over here. But the hovering is seeing, ah, he's just read 1,000 sequential tuples, so let's just prefetch, huh? And then you do the merging here, and by the time you come back, at least parts of that data have already been prefetched, so you don't suffer that full latency. You see what I'm saying? >>: My point is the local versus remote, right? Accessing local memory versus accessing remote memory, there's usually a three to four X ratio. I mean, instead of doing [indiscernible] it taxes memory from [indiscernible] three, it's got to be three or four times slower than accessing memory from the local NUMA. That's what I'm saying. >> Jens Teubner: Yeah, I don't see -- I don't see your point, really. >>: So you're assuming that this -- these machines support prefetch of things so that you can prefetch something and then go off and do something else while you wait for it to get loaded? >> Jens Teubner: At least to some extent that's what happened. Of course it's not, you know, it doesn't magically do everything, right? But ->>: I'm trying to decide what the difference was between [indiscernible] the fan-out of a merge versus simply increasing the depth of the buffers. >> Jens Teubner: The thing is when you -- okay. Basically -- how can I say that? If you have a low fan-out, then you have to repeatedly, basically, swap out the stuff to memory again and reread it from memory. So that's why you need much more memory bandwidth. >>: So it doesn't affect the number of remote accesses that you make, right? It's going to be -that's going to be the same no matter what. It's the number of local memory accesses that you make that the fan-out changes. >> Jens Teubner: Here you're doing basically -you're reading all your input data once and writing it out once. If you had done just a binary merge, you would read, write, read this and write, and then again read and write. So you would go twice over all your data, and that costs bandwidth. >>: Well, where is your buffers existing? I mean, I'm a little bit confused. Where are the buffers? >> Jens Teubner: Well, they end up being in the cache of that thread, so that's why you choose your buffers to have a size that matches your last level cache. >>: So all those buffers have NUMA zero? >> Jens Teubner: Yes, yes. Yes. Yes. >>: So the data here has been no way partitioned, so at the end we will need the one full socket run? >> Jens Teubner: >>: Yes. Okay. [inaudible] >> Jens Teubner: Okay. One second. Okay. Just -I have just one slide that shows the difference that it makes, so in particular, once you scale up to more threads, then what you see here is that at some point you really get no benefit anymore from more threads because you just need so much bandwidth, you completely saturate the memory subsystem, whereas with this multi-way merging, we keep scaling. The reason why scaling stops here is not the memory bottleneck. It's because here the machine only has 32 physical cores, so over here we start seeing hyper threads, okay. So it's a different effect. Okay. So now this question about total sorting versus non-total sorting, so there's indeed different strategies, like do you want a global sorting at all? What do we do about NUMA? And there's opposing views on that, so one of the views is the one of TU Munich, which is what I'm trying to illustrate here. So what the folks from Munich do is they take their input relation and then do range partitioning. So they -- basically it's hash partitioning, but with, you know, an identity hash function and you look at a certain number of bits and then do range partitioning and then you know, you know -- so all the blue -- all the low key values end up here and so on. And then you sort locally in each NUMA region, right? up with a fully sort of result. And then you end Now, the claim of the -- and here in the range partitioning phase, you do have to cross the NUMA boundaries, right? Whether you do this with merging as I just sketched before or whether you do this with range partitioning, that doesn't really make a difference. You do have to cross the NUMA boundary. >>: Using local sort and not using the caching will actually increase our [indiscernible]. So what I was talking about earlier that instead of this local sort and partitioning [indiscernible] we could just do the entire hashing and have several [inaudible] ready to be merged. >> Jens Teubner: >>: I didn't quite get that. Okay. >> Jens Teubner: So in the strategy of this MPSM, massively parallel sort-merge join, is to not sort the other relation, not globally sort the other relation, but just do a local sort. Then you end up with blue, red, green, and yellow data on each of the cores and then you do the join as follows. On each of the cores, of course, you do the same thing. You do a sequence of joins, merge joins where you merge this guy with that guy, and you then merge that upper red guy with that red guy and so on, okay. So that's one possible way of doing that. performance numbers in a minute. We'll see The one downside of that is you don't end up with a globally sorted result because you're merging in values from the same range multiple times and then those won't be globally sorted, okay. Alternatively what you can do is you can do local sorting first and then do the -- achieve a global sorting with this merging operation, as I sketched before, right? And then you end up with a globally sorted result. This is kind of -- you can either choose this or you can do the range partitioning and then the local sorting first. It doesn't make so much of a difference. What we're proposing is to do the same thing for the other relation, as well, so create globally sorted data for both tables and then you can compute the join completely local, so you don't have to do the merging, the merge sort merging across the NUMA boundaries, okay. >>: So how much is data skew a problem here? could be hard to get the partitioning right? It >> Jens Teubner: Actually, that's a good point. We haven't really measured the skew problem yet. So the -- with this range partitioning what you're doing is you're building -- any way for partitioning, you need to first build up a histogram of your data so you know how large your partitions are going to be because you need to allocate them in memory properly. And then based on those statistics, you also can choose your partition boundaries properly. If you're doing something like that, you have to do it slightly different, but at least here you have sorted data so you know something about the data distribution and then can choose your partition boundaries properly. But that's something indeed where with skew, that's something that's very high on our priority list. It's true. >>: So I have a question. The assumption here is that both tables fit in memory, right? >> Jens Teubner: Yes. So putting things together, so previously I was talking about hash join alone, and there we saw that this radix join has an edge over -- I call it here in the eve. It's the hardware-oblivious implementation. Here for a larger data set that we looked at before, this is the data set that was used by the TUM guys, but you know, we saw similar numbers previously also for smaller data sets. Sort-merge, in comparison to that, what I'm showing here is two sort-merge implementations. Basically it's the two strategies that I had on the slide before. It's the MPSM strategy and the one that uses total global sorting. So this is MPSM and this is global sorting. So our experiments show that global sorting is actually advantages in the end. >>: What's the axis? >> Jens Teubner: Oh, sorry. Oh, yeah, I forgot to label the Y axis. That's throughput. That's million output tuples per second. Sorry. Oh, yeah, that's good point. Yeah. Sorry. Yeah. >>: So one question. For this type of join, how many euros qualify or [indiscernible] qualify? >> Jens Teubner: Yeah, you have -- it's like -there's you a one-to-one match, yeah, yeah. Okay. So what you can see here is that these two implementations don't yet use SIMD acceleration. Introducing SIMD brings quite a bit of an advantage, and what we found makes a real difference is this multi-way merging, so saving bandwidth really in the end helps you quite a bit. And well implemented, what we're seeing in our experiments is that sort-merge gets close to hash but it's not quite as fast as hash. We previously had the situation that you have to be a little bit careful about the data sets where you reason over it, so this is the experiment that I just showed before. If you take a different data set, the picture does change. So this is the data set that we saw before, gigabyte joined with a gigabyte. Here are the -- well, this is to the advantage of the hash join. One thing that I would like to note, when you -- I've seen several papers and talks about all this topic, I was often surprised how confusing you can present results, so what I'm showing here is how nicely you can fool readers. So a join, it's really unclear what is your -- what is really your metric. Do you choose -- do you -- so do you measure input tuples or output tuples per second? And most of the papers chose to use output tuples per second. The TUM paper chooses -- no, actually, I did something wrong here now. No, this is output and this is input. So this must be -- this must be output and this must be input. Sorry. So depending on what your metric is, you get really different scaling properties. So what's being varied here is that you keep one relation constant and you increase the size of the other relation, so you change the relative size. And if you count output tuples per second, which was the metric on all the slides before and in most of the papers, then you're seeing these characteristics, and if you count input tuples per second, you're seeing this one here. Of course, the input tuples, these are more than output tuples. So that's why you get higher values here. When you sit in a talk and see something saying -someone saying I have this many hundred thousand tuples, 100 million tuples per second, be careful what he's talking about. >>: Question. ABX or -- On the SIMD permutation, did you use >> Jens Teubner: >>: [inaudible] >> Jens Teubner: >>: What? You get X1 actually, right? >> Jens Teubner: right? >>: Yes, yes. Yeah, IBX2 is not yet available, [inaudible] >> Jens Teubner: Yeah, yeah. So we're actually looking very much forward to try out on -- yeah. >>: Okay. >> Jens Teubner: Yeah, we don't have hardware yet, but we are very much looking forward. We actually played around a little bit. There was this question before about if sort-merge starts to become attractive for joins, what about the other operations such as aggregation, so here is -- so what I'm comparing here is the state-of-art plat algorithm for -- which is hash based for aggregation, and you see the throughput depending on the number of distinct groups for a group operation. For few groups hashing gets very cache efficient, but at some point it just gets expensive. Here we see that sort-merge is still, you know, depending on the commonality, it can be significantly slower, but it is very, very robust. I mean, this is something that's, I guess, known also from classical databases. These are two implementations, one that's just a plain sort and then aggregate, and the other one does partial aggregation in between. When you see in intermediate runs, you have sequences of the same value, you can aggregate them right away. >>: [inaudible] >> Jens Teubner: Plat does that to some extent, yeah. Okay. With that, let me conclude. So basically I argued about two things. One was about are hardware-conscious optimizations worth the effort. This is what I focused on a little bit on the hashing side. And our current results indicate that yes, it is really worth the effort. They're much more robust than you might think and they're simply faster. We found that hash join is still the faster alternative, but certainly merge join is getting close. What is like hard to evaluate numerically is that sort-merge has this nice property that it produces sorted output, which might be useful for upstream [indiscernible] processing, right? So that might be another argument for sort-merge joins. What we're really looking forward to is have a look at future hardware. Haswell has been announced also with I think at least gather support, so in the SIMD registers you could read from multiple memory locations with one instruction, which, of course, favors hashing, so we have -- we're going to see wider SIMD which favors sort-merge, but we might also see functionality in that direction, and what does that mean to the relative performance. With that, I'm -- oops. crashed. Sorry. Yeah. Okay. With that, my app [applause] I wanted to say a thank you to in particular to Charlie, the Ph.D. student who did that. Questions? I used up almost all the time. Sorry. >>: Thank you very much. >>: Thank you. [applause]