>> Ravi Ramamurthy: So good morning. It's a pleasure to have two visitors today, Themis and Stratos, and both of them are ex-interns of MSR. Themis is currently a faculty member at the University of Trento, Italy. He did his Ph.D. -got his Ph.D. from the University of Toronto. He works broadly in the area of event processing, and he's got a best paper award in ICDE 2009 and he's going to be the co-general chair of the VLDB 2013. VLDB 2013 is in Italy, so that's good news for us. Stratos Idroes is from CWI. He got his Ph.D. working with Martin Kirsten [phonetic], and his thesis won the SIGMOD doctoral dissertation about this here. And he's continuing that as a faculty member, and he's going to be talking today about adaptive indexing, and Themis is going to be talking about indexing also, but indexing time series. >> Themis Palpanas: Thank you very much. Thanks for the invitation and for organizing this, Ravi. So this is work on indexing and mining a billion time series, and a billion in the particular case of time series is a big number since previous works have only considered up to 10 million time series. So this is joint work with my student Allessandro Cameraa from Toronto and also Jin Shieh and Eamonn Keogh from the University of California at Riverside. I will still try to end my talk by 10:15 so that will give time to Stratos as well. Please feel free to interrupt me if you have questions. So this is all about time series. And has a time series? A time series is just sequential data points measured over time. And time series are ubiquitous. So there are time series in finance, in any kind of scientific experiment and domain, and we also have time series in some unexpected domains. So, for example, motion capture, just recognition in the video analysis, as well as time series that describe the contour of shapes. In this particular case we can transform contour of these shapes to a time series and then use time series techniques to identify similar shapes no matter for the scaling, rotation, et cetera, of these shapes. So there's a pressing need to index time series data, and the reason that we want to do this is because it is only by indexing them that we can actually mine this data fast. So the basic operation that we want to support very fast is the similarity queries in a time series. And this is going to be the focus of the index that I'm going to talk about now. If you actually can support this operation very fast, then you can also support all different kinds of mining operations on a time series such as classification, clustering, deviation detection, trend analysis and so on and so forth. The issue here is that this time series collections that we're going to index and then mine can grow extremely big. So these are two examples, one from General Electric where they want to monitor the operation of gas turbines that they operate in different parts of the world. In this case they talk about million samples per minute. Another example is from the Tennessee Valley Authority. This is a company in the eastern part of the United States that monitors the health of the electrical grid in the East Coast, and these guys talk about rates of 3.6 billion points per day. So these are measurements over time, they are naturally time series, and these guys want to actually at some point mine these time series, identify similar trends, et cetera. So this is the basis of what we're going to be talking about. So we want to index and then mine these time series, and, of course, there is some very good news. So there are time series indices that have been proposed in the literature, and one of those that's going to be the focus of this study is iSAX. So iSAX is a recent index that previous studies have proven its utility. So I'm going to just talk about this particular index. The bad news is that building this index takes too long. So, for example, there is no study that uses more than 1 or 10 million time series, and in the particular case of iSAX, if we're going to index 500 million time series, and in this case I'm talking about time series of size 256 real values, it takes 20 days, which makes it hardly practical for large collections. So the contributions of our work is that we propose novel mechanisms for the scaleable indexing of these time series. So basically we propose a bulk loading algorithm for this iSAX index that reduced the number of random accesses but two orders of magnitude. So obviously here the main problem is how to make sure that all your disk accesses -- that most of your disk accesses are sequential. And the second contribution here is the node splitting policies. With the new method that we propose we can reduce the index size by one-third. So this second contribution came about because the iSAX index, as I will talk in more detail later on, is that balanced. So we have the first experimental evaluation with this kind of size of a time series collection, 1 billion time series, which validates the scalability claims, and we also have some case studies in diverse domains: Entomology, genome sequences, and web images. And I will just briefly talked about the second application, the genome sequences. So a little bit of background on the iSAX. So the iSAX index is based on the iSAX representation or summarization of time series. So if you assume that we have this time series there with all the blue points, the first step in order to produce the iSAX representation of this time series is to do a PAA summarization. So basically we produce segments of equal length, and each one of these segments has s value, the average value, for the points that it contains. Then in the next step what we do is that we divide the y axis into different regions, and for every segment that falls in one of these regions, we assign to that segment a bitwise representation that encodes that region. So this particular example, we have divided the entire space in four regions, and we need two bits to encode those four regions. So any segment that falls in this region is encoded with two bits, 1, 1. Any segment that falls in this region is encoded with two bits, 0, 1, and so on and so forth. This division of the space is not equal because this time series are normalized, so they have a zero [inaudible] deviation of 1, and what we're trying to do is end up with an approximately equal number of segments falling into each one of these regions. So this is the basis for the representation. And the next thing with this representation is that we go from the real value domain to the domain of bitwise representation. Thus, we save lots of space. And we can also have a representation of some time series with four segments where each segment has a different cardinality. So we can have a varying level revolve detail for the representation of each segment. And all this is naturally supported by iSAX. Now, how does the index look like? It is a tree which is not balanced, and that's what I'm trying to show here, and in this tree we start out by a very coarse representation of the time series. And, of course, as we go down the index, then we -- then we refine this representation. And obviously before I using this structure, we can quickly search for similar time series, and we prune the search space. So, for example, if I search for a time series that has a representation that is somewhere here, so it has this kind of shape, I will go down this path and I will exclude the other paths. So, for example, paths that represent this kind of time series, right? And this is what makes that fast. Of course, we end up with the exact solution in this case, right? One important thing here is that with the iSAX index what we do is that we store the entire index in main memory, and this is feasible because we use the bitwise representation for time series in all internal nodes, and in the main memory we only have this bitwise representation for every internal node. For the leaf nodes, these nodes are stored on disk, and apart from the bitwise iSAX representation, they also have to contain the raw time series, all the individual real values, because at the very end we have to access the raw time series and discard any false/positives from our similarity search queries. So this is a wonderful index, but then the question is why it takes so long to index large time series collections. And, of course, the answer is the iSAX implements a naive node splitting policy. So when it splits a node into two, it may be the case that all the time series of the node will end up in one of children, and this is what we try to address. And the most important thing is that it implements no bulk loading strategy. >>: So this looks like a data reduction followed by the two-dimensional indexing. Is that the right ->> Themis Palpanas: This is not two-dimensional because you may have several segments, right? And the number of segments, which is the dimensionality, may also be in the order of 10s of dozens. And this is why traditional structures do not work very well here. Actually, some studies have shown that in several cases it pays off to just do a linear scan. >>: But do you do the data reduction before you load the data or do you do that as you go? >> Themis Palpanas: As we go. >>: Okay. So the question I was trying think of is do you know how you're going to reduce the data? Do you know how you're going to map it? >> Themis Palpanas: Yes, we do. >>: So you know the bands that you're going to map to your bits ahead of time? That's not something you find ->> Themis Palpanas: Yes, we do. We do, yeah. So this part is fixed. Okay. So let me first say a few things about the bulk loading algorithm. So the design principles here is that we're going to take advantage of all the available main memory and maximize the number of sequential disk accesses. The intuition is that we want to group all the time series that are going to end up in the same leaf node together and write them to disk at once with a number of sequential accesses. And the problem why we cannot do this easily is because when you have 1 billion time series, all the time series that are going to follow in the same leaf node may be dispersed along this collection. So if I had a way to precluster all my time series according to the leaf node, then I will solve this problem. And actually several of the bulk loading strategies for traditional indices is based on exactly this idea, that you do some kind of preclustering of your data and you then bulk load the index. This cannot work here because in this domain it's extremely hard to cluster unless you have an index. So the index helps you do the clustering fast. So we cannot follow that approach. And here is the high-level description of what we propose. So we propose an algorithm that works in two phases. During phase one we read time series -- we read in the time series and we group them according to the first-level nodes. The first-level nodes are just below -- are the children of root. And in this operation we use the entire available main memory. When we have exhausted the main memory, then we switch to phase two. During phase two we process together all the time series that belong -- excuse me, that belong to -that are contained in the same first-level node. So we grow the subtree below that node up to the leaf nodes and we flush to disk. At the end of phase two we have processed the time series contained in all the first-level nodes, we have flushed it all to disk, we have freed up all the main memory apart from the main memory used by the index which is a small portion, and we go back to phase one. So let me show you how this looks like. So this is the root node and these are the first-level nodes, children of the root, which have not been materialized in main memory yet. Right? And this is why they are dust [phonetic]. So we introduce this first buffer layer which is a layer of buffers. We have one buffer connected to each one of these first-level nodes. These buffers do not have a fixed size, so they grow as needed according to how many time series are routed to each one of these nodes. So what happens is that when time series come in, we just route them to the corresponding node. Instead of inserting them in the actual node, which in this case would be a leaf node, we just insert them in the buffer. When we have used up all the memory, then we go to phase two. In phase two we have this leaf buffer layer which is another layer of buffers. And in this case each buffer corresponds to each one of the leaf nodes in the tree. And the size of each leaf buffer is the same as the size that the leaf node has. So it contains, for example, 1,000 time series. So this picture here shows what has happened when they are midway through phase two. So I have already processed these two buffers. I have grown this part of the tree and flushed everything to disk. So let's see what happens here. So in this case I have a whole bunch of time series in this buffer. I process all these time series and I grow the subtree rooted at this node. So I keep inserting time series here. Once I exceed the limit of 1,000 time series, I split this node into two, I divide this time series, and so on and so forth. At the end when I'm done inserting all the time series from the first buffer layer to the leaf buffer layer, I'm going to flush those to disk. Now, the interesting thing here is that when I do this, I need no extra memory because I just moved time series from one position in the memory to another. Right? So I'm still making use of -- I'm still making good use of the memory. And then, of course, I have just flush everything to disk. At this point I have released all the main memory and I'm ready to start the first phase again. Okay. And these, of course, are mainly sequential writes. Let's also say a few things about the splitting policy that we use. So the design principles here are that we want to keep the index small and we want to -- and when we split some node we want to pretty much equally divide the time series in the two children nodes. And so the splitting happens according to the iSAX representation of the time series, right? And the intuition for the solution that we propose is that we want to split that segment of the representation for which the iSAX symbols of the time series will fall almost equally to the two sides of the breakpoint. So let me show you a picture of how this works. Assume that we have time series and we represent them using four segments. If I use a cardinality of two -a cardinality of one bit for the iSAX representation, this means that I only have two regions, 0 and n1. Let's say that these are all the time series, so each combination here is one time series in my node. And my node [inaudible] uses one bit of cardinality to represent this time series. Now, I want to increase this cardinality by one. So if I use cardinality two, then this means that I have now four regions to divide the time series. And we split the segment for which the highest cardinality symbols lie on both sides of the breakpoint. So, for example, in this particular case, if I decide to split this segment, then all the time series will end up to one of the children. If I split this segment, on this segment, then I will have an approximately equal number of time series going to one child and to the other. And it turns out that we can actually do this pretty efficiently just by recording the first two statistical moments which we can do efficiently in an online [phonetic] fashion. So we code them in the standard deviation, and then when we just have to do is pick the segment to split for which the breakpoint, the [inaudible] line, is within this range of mean plus minus 3 standard deviations and closest to the mean. So this is obviously a heuristic which tries to capture the [inaudible] that I mentioned earlier that we want to have time series that fall to both sides of the breakpoint. And at this time turns out, as I will show later on, that this works pretty well. So what happens if no -- if the breakpoint for no segment satisfies this condition? So in this case what we will do is that we will keep increasing the cardinality of the representation and at which point we can satisfy this condition. So this means that in our new approach we may end up skipping some steps of the regional iSAX index algorithm and move faster towards a good split. So the experimental evaluation, we implemented this in C-Sharp and tested using one big machine that had 24 big bites of memory, two terabytes of disk, and also some small desktop, 3 gigs of main memory, a half tera of disk. We compared against the original iSAX algorithm that was making use of all the available main memory for disk buffering and also iSAX with BufferTree. So BufferTree -- so this is the original iSAX with BufferTree. BufferTree is another technique that has been proposed in the literature for bulk loading. This was proposed for our trees. And as you will see, this doesn't perform very well. The reason is because BufferTree was proposed for a balanced index. In our case, iSAX not balanced, so BufferTree, which uses a fixed amount of memory for specific levels of the tree, does not make fair use of the main memory because it preallocates this memory and does not know where we will need it most. And this is exactly what we're doing. So if we focus just on the benefits of the splitting policy, here's what we have. So in this case we did some experiments with the time series with collections of time series from 1 up to 100 million in both cases time series. In the left graph we saw the index size. And we can actually see that with the new splitting policy it will result in 34 percent less nodes for the index. This is because all of our splits are useful. And, actually, the node occupancy also increases by 20 percent. So this means that we end up with a more compact index, and this results in a smaller number of leaf nodes, which translates to fewer -- to less disk IO. So just by using the new splitting policy, the build time for these collections of time series is reduced by one-third. Now, let's see what happens with bulk loading. So previous techniques obviously take too long, so it takes 20 days to index 500 million time series. This is the original iSAX in red using all the available main memory for disk IO. This other line here is the original iSAX with the BufferTree. And as I was explaining earlier, we see that the performance is even worse, and this is because the BufferTree preallocates memory in specific nodes in the tree. Our tree not balanced, so this does not work in our case. So this is the result with the new index. So with a new index we can scale up to 1 billion time series for our experiments. We could not do the same experiment with the original iSAX. In this case we estimate that it will have taken approximately two months to index all these time series. In our cases we finished the job in 16 days, which is 72 percent less time, and this translates to a time of indexing per time series of approximately 1 millisecond. Let me also say that the reason for these results is mainly the disk accesses, that we managed to not only reduce the number of disk IO but also make sure that the vast majority of these disk accesses were sequential. So in this graph here we saw for collections of time series of 100 to 1 billion time series, we saw that iSAX can reduce the number of disk page accesses by 35 percent. And almost all of the disk accesses that iSAX does, so this is 99.5 percent of the disk accesses, are sequence, which makes a tremendous difference. And so we also did some experiments with some real data. I'm just going to talk about this second case study that talks about genomic data. This is 22 million time series of size 640, total size 115 gigabytes, and these are experiments on the smaller -- the desktop computer. In this experiment we wanted to identify mappings between chromosomes of two different species, the human and Rhesus Macaque. So Rhesus Macaque is a species of monkey that is relevant to human. Biologists are very interested in these mappings exactly because these two species are so close to each other. The problem with these mappings is that these two species have a different number of chromosomes. So these mappings are not obvious. So what we did is that we translated these DNA sequences -- we translate the genome sequences in a time series by chopping the map in small time series of size 640, of overlapping time series of 640. And then what we wanted to do was try to see if the same or similar time series occur in the chromosomes of these two different species. And we ended up with this kind of picture that basically shows what is the mapping between the human chromosomes and the monkey chromosomes. So each line here shows one possible set's mapping. So mapping is a subregion of a monkey chromosome which is the same -- which is similar to a subregion in some human chromosome. And it's interesting to see here that in some cases we have regions of the same chromosome of one species mapped to two different -- to subregions in two different chromosomes in the other species. So here we're not basically claiming that we have solved some biological problem, but the claim is that we can help the biologists do their research in this particular domain. So, for example, this picture shows the mappings that have already been verified by biologists in their experiments. And if you observe these two, the mappings that the biologists have identified are a subset of the mappings produced automatically by our approach. Once again, we're not claiming that we have solved this problem, but we can claim that we can direct scientists in the interesting parts in their data so that they can very quickly focus on what is highly probable mapping in this case. I'm going to skip this. So once again, one of the take-home messages here is that this is the first approach that can scale to such large time series collections, one million time series, and what we're trying to do here is enable practitioners and scientists the pain-free analysis of their existing data set collections. Actually, we have some -- we're working on some improvements of our technique, and our experiments show that we can get a 40 percent reduction in the total build time. So this will mean an index time for time series of approximately half millisecond. I would like to conclude with this thought. So what next? We think that the next challenge will be to index 10 billion time series. And even [inaudible] that this is not a matter of chasing numbers. This is all about enabling people to actually analyze the data that they already have. So we're working with some neurobiologists. So these guys have these big machines and they do functional magnetic resonance imaging. So a person gets in there and they analyze their brain. Whenever there is a stimuli to the subject, then this machine produces an image that shows how the subject responds to that stimuli. Well, a single experiment, which means one subject in a single test, produces 12 gigs of data, 60,000 time series. 60,000 are the number of points in the brain that we're going to talk about. So we can individually focus on each one of these 60,000 points of length 3,000. And what's even more interesting is that right now there's one competition. There's a classification task to detect based on this kind of data whether a subject suffers or not from ADHD. So in this case for the competition there are 776 subjects for a total size of 9 terabytes of data, which and you translate this to what we've been talking about in this presentation, we talk about 4.5 billion non-overlapping series or 1000 billion overlapping series if you're going to do a more fine-grained analysis. What these people have been doing so far is that they are reducing each one of these 60,000 time series to a single number and then trying to do the classification. And obviously there is much more work in their data than that. So this is the point that I'm that we need to help them use all their data. Of course, I agree that the parallelization helps, and there's a lot of room in doing that here. Actually what we proposed is amenable to parallelization, but, once again, this is all about how to most efficiently use each individual machine. Right. So this is the end of my talk. Let me just say that this is where we're located, at the center of Europe. We're open to collaborations, and if you're ever in the area, we'll be glad to have you over for a visit to talk about research and also for other reasons [laughter]. So thank you very much. Any questions? [applause]. >>: What about scaling this out? >> Themis Palpanas: So you mean to multiple machines? >>: Of course. >> Themis Palpanas: Yeah, we haven't looked at that yet. But what we've been talking about amenable to parallelization. >>: Is there any technical problem? You just partition up the stream and you build indices on multiple machines and then merge them at the end or do you fan out queries at the end? >> Themis Palpanas: Well, this talks about the index build part. So what I was thinking about was to parallelize the first level of the index. So you have a different machine cope with each one of the first-level nodes and their subtree. I haven't -- we haven't really worked on that yet, but that is a very interesting direction for sure. >>: To what extent does the efficiency of approach depend upon the partition be being reasonably even, reasonably uniform? Do you require -- do you depend on in some way -- say you have sort of a preliminary index built in to use to help you with the subsequent partitioning. Are you depending upon the fact that that's good sample of what the data is? And what happens if the data is skewed in a different way than that sample? >> Themis Palpanas: Right. So this is exactly why using iSAX with BufferTree suffered. Because basically this approach assumes a uniform partitioning of the data which is not true in real life, and that's why it doesn't perform well. In our approach we make no such assumption, and it doesn't matter what the partitioning is or the skew of the data. So this has no effect whatsoever. As I said, when describing the algorithm, we don't have a fixed amount of memory assigned to each buffer in the first level, right? So these buffers grow as needed according to what data come in. So there's no problem ->>: But it could mean that your partitioning is not very effective, right? It could mean that you have a whole bunch of partitions where modest amounts of data go and an enormous amount of data go to one partition ->> Themis Palpanas: Sure. That's no problem. >>: -- in which case your partitioning wouldn't have done you much good, right? >> Themis Palpanas: Well, this will not affect the index build time. This may affect searching, right? And then the question is whether this particular index -that's not the right question -- whether the particular parameters used for this index are appropriate or not for a particular data set. But this doesn't affect what I have presented here. >>: So the non-fractional time series to build an index [inaudible] so have you thought about taking advantage of this in some way to improve performance because there's so much cardinality between these? >> Themis Palpanas: Right. This is a good question. So this goes back to whether you can actually cluster or not your time series to start with. But this will assume that you have actually done or you know how to do such clustering. In our case we don't assume that we know that. So we don't assume that all the overlapping ones are together. Also, another thing is that even if you do know, this overlap only has -- it only has a limited advantage, because after a very short time you end up having time series that are significantly different from each other. Okay. Thank you very much. [applause] >> Stratos Idreos: Hi. I'm Stratos. I will talk today about database cracking and [inaudible]. Database cracking was the topic of my inaudible thesis back in Amsterdam, and this work is in general joint work with Martin Kersten and Stefan Manegold which were my advisors, but also Goetz Graefe and Harumi Kuno which joined us after some point. And actually part of the thesis is also co-authored by them. So what I'm going to try to do is give a five or ten-minute general overview of the whole thing and then zoom in one of the subtopics, one of the technical topics, which is [inaudible], and it's one of the most important topics in column storing. This work is in column stores. But those of you who want to see more of the other details, I will be -- I have the slides to talk about this. So in general, our problem is physical design, how to tune databases, how to make it as automatic as possible. So nowadays I guess you know this doesn't work out of the box. We really have to go and decide what kind of [inaudible] we want -- why do we want to create these indexes and when, on which of the parts. So this is typically the job of the DBAs and offline auto-tuning tools, but this is quite a hard topic. And typically how it works is as follows: So you have this kind of a timeline where you first collect workload information or get this somehow. Then you analyze this workload information depending on your system, and then you go ahead and you actually create a physical design. So this takes quite some time, and only after this whole thing is finished, only then you can process queries. So our goal here is to basically try to make this much faster. So the question is what happens with dynamic workloads where you can really break this timeline, right? You might not have all the knowledge. You might not have all the time to do this. And what happens with very large databases, like scientific databases, for example? So it's difficult places where you have terabytes of data coming on a daily basis, especially we expect this in the future, then making possibly a wrong choice about creating an index can be a big cost because it takes so much time to create indexes. So we identified idle time and workload knowledge as two of the critical parameters that we are trying to work on. So idle time you need to know in order to do the analyze on the indexes, and workload knowledge you need because you want to know on which data parts you want to build indexes. So if you don't have this information -- and nowadays, if you think about it, like social networks [inaudible] and social networks and different databases, you don't really have this kind of information up front. So what can go wrong? For example, you might not have all the time that you need to finish the proper tuning. You might have some of the time, but not all the time. Second, by the time you fix the tuning, the workload might have changed. So let's say scientists look on a database, on the new data that they receive from, let's say, telescopes, and they realize that they're looking for something else, and then they have to change the workload [inaudible]. Still, if none of this is true, there's no indexing support during the tuning phase. So you have to rely -- if you really want to query your data and the way those are for indexing, in the meantime you have to rely on very [inaudible] performance, rely on [inaudible]. Another part is that currently our indexing technology does not allow to focus on specific data parts. You still have to index whole columns. You can select columns, but you cannot select parts of columns. So database cracking has a vision of basically removing all physical design steps and still getting the similar performance without a fully tuned system, and then you don't need DBAs. And the way that we found out that we can approach this goal is that we go deep inside the kernel and try to do that. So we started designing new database concerns which we call auto-tuning database kernels, and that involves really starting from scratch. You have to think about operators, new algorithms, new structures, new plans, new optimizer policies, everything from scratch. So let's say -- well, there's no monitoring steps and there's no preparation steps because we say we assume that there's no idle time. There's no external tools because we say that takes too much time. Everything is inside the kernel. There's no full indexes because we don't want to index parts that we're not interested in, so we selectively index small parts of the data. And there's no human involvement because, again, everything is automatic inside the kernel. So what we do is continuous on-the-fly physical reorganization. So continuously as the data arrives -- as the queries arrive, we continuously reorganize the data. And you can think of this as partial incremental adaptive indexing. We partial index, we adaptively index. Everything is driven by the queries. And some of the techniques that we'll show you is fully designed for column stores. Some of it should be possible in [inaudible] as well, but maybe with some bigger changes. So the main mode of database cracking is these two lines here that every query that we get, we treat it as an advice on how we should store the data. So we see the system sees the queries and says, okay, this is how to store the data and there's small changes continuously. So let's see an example for that. And this is not how cracking works nowadays, but this is the very first implementation, and I think it's a simple example to start with. So let's say we have this single column here in the column store. It's a fixed width dense array. And this is a very simple query. So just [inaudible] predicate requesting values between 10 and 14 of column A. So to answer this query we will reorganize the column physically. So we'll take all the values smaller than 10 and move them to the first part of the column, all the values between 10 and 14 in the middle part of the column and then the rest of the values which are bigger than 14. So now you should have created some sort of [inaudible] partitioning in there, and as far as this query is concerned, we'll collect the result values in the [inaudible] area. So we can say whatever it is, we just answered the query. But more crucially, we gained some knowledge that we can exploit in the future because there's some range [phonetic] partitioning in there. Now, the important thing for database cracking is that this is not something that happened after we processed the query, but this is how we process the query. So this is the very first select-operator that we actually created for the cracking. So the select-operator has -- is a simple procedure that takes a single column and the rates predicate and then returns the column back reorganized, but all of the values inside the rates are collected in a continuous area. And then since this comes in bulk processing column [inaudible], you give this to the next operator and say why don't we do this in that position of the array, you have the qualifying values, and you can go ahead with the rest of the query plan. So now if you get a second query, again, this is a very similar query, so basically it's just a one-operator query. You only have a select-operator now we have a different predicate on a different range request but on the same column. So now we can exploit the range partitioning in there. And remember we're doing continuous physical reorganization, so we take now the low bound which is No. 7, and we see that it falls within the first piece. So we go ahead and, again, we crack the first piece using the low bound now. So we take values smaller than 7, values bigger than 7, and we'll crack them to new pieces. Then we go to the middle piece and we realize that, well, this qualifies completely, so we don't do anything with that. And now we're gaining some speed because we don't touch this piece. And then we go to the next piece and say that, well, this is where the high bound falls in, and, again, it's not an exact match. We don't have an exact piece on this value before. Now we crack this on value 16 and create two new pieces. And the result is the same as before. We just collected the values in a continuous area, so everything between 7 and 16 is now in the three middle pieces, and more crucially, we gained even more knowledge about the future. So now we have even more partitioning information, even more structure. Now, if you generalize this and think how this evolves over a thousands queries, then you have more and more of these pieces that become smaller and smaller. This means that they can skip more and more pieces. And in general you only have to crack -- with each range request in select operator, you only have to crack at most two pieces on the edge of the request -- of the range that you request. So this becomes continuously faster and faster. And let's see an example here. So here we have single selections. It's random selectivity and random value ranges requested, and we can compare the plain scan which is the red here, so basically just every query scans the column, and then the purple here is cracking and the green category is fully indexing. And by fully indexing here I mean that the very first query will go ahead and sort completely with quick sort the problem and then you can just do binary search. So the first query does quick sort and binary search, and then from the second query on you just do binary search. So observation number one is that the first query is of course much for expensive for fully indexing. While cracking is relatively comparable to the scan, typically it's about 20, 30 percent slower. So this means that the user doesn't really notice any difference, big difference, for [inaudible] really comparable to the default performance that he would expect from the system. Then the more queries that you answer, performance becomes better, and the x axis is the number of queries and the y axis is the response time, so you see that the more we go to the right, the more queries that we answer, automatically performance improves. Of course, full indexing is super fast performance because we just have spent all the time investing in creating the index. And then eventually after main queries, cracking nicely converges to the optimal performance. So the main message here is that without having to spend the full indexing time up front, assuming that we don't have any knowledge and time to do this, of course, we can still improve performance without a noticeable overhead for the user. Now, if we change the metric so the y axis now is cumulative average time -and, again, the curves are the same techniques -- then we see that after 10,000 queries, still full indexing has not amortized the initialization costs. So this means that if -- anywhere in between these 10,000 queries the [inaudible] so this column not useful anymore for the workload, then the whole indexing effort up front, it was a weight of time. And, of course, also let me note that these examples are completely random. I mean, we really assume here that the users are interested in the whole space of the column, the whole value range, which might not be the case. If you have a skewed workload, and I will show you such a case later on with [inaudible], then performance for cracking will improve basically instantly, and then this difference will be much, much higher. Okay. So I have the slides now for the -- basically what we did over the years with database cracking and then I will zoom in in one of the subproblems. So first thing was selection cracking, which was basically the slides that I showed you before, how you take the select operator and how you adapt the select operator for continuous [inaudible] organization. Then we started updates. And updates is quite crucial here because we keep changing the columns continuously. Basically we do -- every read query, we make it a write query. So what happens if you have updates as well. And then we came up with some, again, adaptive algorithms to plug in updates inside, again, the select operator. So updates happen in adaptive way in the sense that when you get updates, you put them on the side, and when queries come that actually need the actual column, the actual value range in this value, then while you're going and reorganizing for this value range, you selectively plug in only the updates that this specific query requires. Then we worked on what actually made cracking possible in the complete view of a database [inaudible] which allowed us to run completely [inaudible] queries, which we call sideways cracking, and this is what I will talk about in the rest of the talk. There's some work on joins. I'm still trying to publish that. And our latest work on adaptive indexing, which is basically a thing about doing partitioned cracking, trying to do several optimizations to fit this also in the disk and then play around with how much initialization cost and how much fast convergence you can have and what is the balance between these kind of things. So here's a slide, an overview slide, visually about the techniques. So let's say you have these two tables here, and think of the empty rectangles over there as the indexing space for these tables. So first thing is that we do partial materialization. So we don't build complete indexes, but selectively, when queries come, we just pick out only the values that the queries request. So we create these small pieces over the tables and we'll also index those spaces, and we continuously reorganize data inside these spaces in keeping up this information. When we run out of storage, we have a storage threshold, then we can just selectively drop these pieces using LRU and create new ones. And then we want to use two columns from the same query plan, then we would have to make sure that these columns are aligned. So we don't do tuple reconstruction, which is practically the most expensive operation in the column store. So when we want to use two columns together in the query plan, we don't try to do a [inaudible] join. Instead of we just make sure that these columns have been reorganized in exactly the same way, such as -- let's say in the x position of column A you find let's say tuple x. Then you find -- exactly in the same position in column B you find exactly the same tuple. And if you reorganize the columns in exactly the same way, then you can make sure this is true. One pieces become small to fit in the [inaudible], we sort them, which helps for administration, for concurrency control, for various issues, and when we queries across tables then we can use this kind of information for efficient zones. Okay. So I will go on now to talk about tuple deconstruction and why is this important both for cracking and for column stores. So this is typically how a column store looks like. You have arrays fixed with dense arrays. You have positional information which is implied. So you know that in the first position, let's say, of column A you find the first tuple, the value of A for the first tuple, and in the first position of column B you will find the value of B for the same tuple, and so on. And this simple function here gives you information about how to browse there. So the only thing that you need is the starting point of the array and the data size of the array, and then you can quickly jump into the proper position if you're looking for a specific value. Now, this is how typical query browsing happens in a column store. So we have this query here with two selections and one aggregation. So we bring the first column in memory, we do a quick selection, and then the result is a set of qualifying IDs. And these are the positions of the values that qualified. So we materialize -- this is bulk processing. So we materialize this set of ideas. And now we're going to do a tuple deconstruction action. So using these IDs we go and fetch the qualifying B values. This is what we call late tuple reconstruction. And then we have the B values, we can run the second selection, and then we'll do another tuple deconstruction action because we need the C column. And then we can do our aggregation. So this little tuple deconstruction actions, they happen very often in query plans, and you have to be very efficient about it. And columns stores are efficient about it by making sure that you can do basically always sequential or skip sequential access by having ordered IDs for [inaudible] results. So it looks a little bit like this. So on the top of the slide you see a set of row IDs. You can think of this as intermediate result. And this is ordered. So once this is ordered you can do a skip sequential access which can be efficient. If it's not ordered, and this can happen after operators like join and [inaudible], then you have to do random access. Of course, [inaudible] what you do is you just short this intermediate result and then you do again skip sequential access. Now, for cracking the thing is that even after selections, because we have reorganized the data, then this row ID is again unordered. So we gain a lot of time, we have a lot of bit in the selection, but then when we go to tuple deconstruction, especially if you have more and more columns in the query plan, then this row ID being unordered becomes a problem. And this is an example of that. So here we run a more complicated query than ones I was showing you before. So here we have a few selections on two different tables and we have a join and multiple aggregation there, so we're using quite some columns. Here the left-most graph is total times and then the two other graphs, basically they split the tuple reconstruction time before and after the join. But the main point that I want to show is that -- okay, you see the red [inaudible] which is plain MonetDB performs quite stable, of course, because it always relies on scans and has nice sequential access for tuple deconstruction. Then you see this green [inaudible] here, so you see that it nicely improves at first, so you give the first queries and it improves, but then as you make the order more and more random because you continuously reorganize the initial columns, then you suddenly have more and more random access patterns, and then performance becomes even worse than the initial MonetDB performance. Then this blue [inaudible] is the sideways cracking, which [inaudible], it solves this problem, and this -- and I will talk about this technique now. And this is [inaudible] here, this is the, let's say, the perfect performance that you're going to get in the column store. So if you know from [inaudible] the column store projections, which is that you take a copy of a table, you completely sort based on non-column, and then you propagate the sorted on the rest of the columns, and then basically you have a perfect index, but that purely depends on the queries on the workload. So if you have prepared that up front, then performance can be very fast because you rely on binary searchs. In this case, for example, it takes 12 seconds to prepare this index, but then it's super fast. So let's see how this sideways cracking deal works. So, again, I'm going to run a similar example than the initial one. So we have these two columns, and this query here refers to two columns of this one. So selection on one column and then you need to do a second projection. So the difference here is that we create -- now we don't work on one column at a time but we work on two columns at a time. So we call this cracking maps, something that maps from column A to column B. And we do exactly the same organization as before. So we create this sort of range partition depending on how queries arrive, but now we work on two columns. So basically we organize based on the head column, which the head column in this case is attribute A, which is the selection column where the selection refers to, and the tail column, which is the projection column, follows this order. So we don't have to do any tuple deconstruction, we don't have to do any expressive join, especially with random access, but we can find the B values. There's just there after we do the reorganization. And it's exactly the same as before. So another query comes on a different request and different range, and we keep reorganizing this thing and we keep carrying the tail values as well, so we get them for free on the tail of these cracking maps. And, again, as before, this is practically the select operator, only that, of course, it's not really select operator. We call it now select plus project because what you get in the end is the B values. Now, internally if you have multiple of these cracking maps, because queries did not just refer to two attributes but they referred to multiple attributes, what you're going to do is that you're going to have these maps aligned. And they're aligned, then you can have correct queried results. So not having aligned maps means that -- let's say we organize this map MAB, then if you want to use in the same query plan MAB and MAC, then we can never have correct results because our alignment is wrong. I mean, we might be [inaudible], but internally we would have wrong results. So what we do, again, we don't do joins. Instead -- we don't do joins in the sense of traditional tuple deconstruction and column stores. Instead what we would do is we keep a history for each map where we mark what kind of reorganization actions we have to play on this map to make sure that it is aligned with the rest of the maps. So practically there's a central history of cracking actions per map set, and map set is all the maps who have same head column, so they have reorganized from the same column. And then each map has a point in this history, and all we have to do is replay this history at the proper time on demand to make sure that the two maps or more maps are aligned. So basically in this case we would do exactly the same reorganization as we did in MAB, we'll do it in MAC and maybe in the rest of the maps as well if we need them for the same query plan, and then everything is aligned. Now, I'm going to show you a more detailed example for that. So let's say we have these three columns here and we run a simple query on the three queries, attributes A and B. So this is the first query. So we go ahead and create this map MAB and we crack it based on this first arranged predicate. Then another query comes. Now it's again on AB, but a slightly different range request. So we do again the -- oh, no. Okay. I was confused. The second one, this here, is MAC. So we create a different map, MAC, and we crack it again independently. And then another query comes with requests both B and C on the same query plan. Again, a selection on A, but now requests both B and C. So we're going to use maps MAB and MAC to do that, and independently they create correct results, but once we try to merge the result together, then the alignment is wrong. So this is what it would produce, and you see that basically result contains slightly out of order tuple. So the first tuple says B4, C6, which is not correct. So the information there is correct, but it's not in the proper order. So you can think of doing things like plugging in some extra columns with some IDs and positions and then doing some reordering, but that turns out to be too expensive. So what we do is this thing that I was talking before about, which is that we try to align adaptively these maps. Now, another approach that one could think of is that, well, the first time that you had to do, let's say, MAB, you also do MAC, and then everything will be fine. But then that's complete out of the mentality motivational for using the columns there in the first place because the first query, for example, requests and wants to load only A and B. Now, if you go ahead and load also C and D and whatever other columns this table refers to, then you're defeating the purpose of using the columns there and having better IOs and things like that. So our solution is that we have this history, these steps I was talking about, and trying to adaptively replay cracking actions cross maps. And let's see how this works. So now this is exactly the same example as before, but it has some extra steps. So first extra step is that when we run the first query, before we do these steps that we did normally, we first run this -- we do this operation. And this operation is basically cracking MAC but using the predicate of the first query, and then we can do the predicate of the second query. So actually this action is not used for this query now, but it will be useful for the second query and you want to have things aligned. So when we go to the next query, this is the extra step now. So before doing this step that we normally do, we first run -- we will use MAB but we use -- we crack it based on the predicate of the second query first, and then both maps have been cracked with the same predicates in the same order. So when we reach this state here MAB has been cracked first with smaller A smaller therein 3 and then A smaller than 5 and then MAC has been cracked first with A smaller than 3 and then A smaller than 5. So they have been cracked with exactly the same arranged predicates in exactly the same order, so they are aligned. They have tuple alignment. And then we can go ahead and apply the [inaudible] query predicate, the third query predicate, and then we get the correct alignment for the result. >>: So aligning [inaudible] >> Stratos Idreos: Well, in this case you have to do some extra work through alignment. I will show you how we minimize that. Basically you always have to replay the tape if you are a bit, let's say, in the past and the history it 10 queries -- you can run another 10 queries on another column, you have to replay these 10 queries on the column it gives you now. That's the only way those are aligned. Other ways that we haven't really played with is that you can create copies, multiple copies, and you have multiple synchronization points in there. But that needs some extra storage. >>: So is that the only -- let's say you don't want alignment. Can you sort it? Is there another way of ->> Stratos Idreos: You can ->>: Is there a cost-based version of this? Let's say you have to catch up on too many [inaudible]. Would you ever use the system to say not to alignment? Is there even a feasible query plan that does not align? >> Stratos Idreos: I see what you're saying. No, we don't have this option. So we always do this kind of thing. We just make sure that -- and I'll show you later how to do this more efficiently. Because if indeed the history is too big, then you can have a big cost trying to align it correctly. But there are ways minimize that, but, no, we don't have that choice. >>: Do you ever do the normal thing of -- or say the historical thing, which is just to sort the record IDs with the data? >> Stratos Idreos: Yeah, yeah. But this is more expensive. So we've tried that, sorting and also [inaudible] clustering, because [inaudible] clustering is cheaper than sorting and still you get nice access patterns, but still that's more expensive. So there are comparisons about this in this SIGMOD online paper. Okay. So let me now show you how you approach these queries when you have multiple selections. So actually for multiple selections you can think of creating wider maps, but then there's too many combinations of that. So what we end up doing is that using these single map sets, which means that we have only these binary maps with the same leading column and using big vectors to organize things. So let's run again an example. We look at by example. We have these four columns here, and this query that does free selections and non-projection. So the first part, the first operator, will create map MAB and it will use the range request on A, on column A, to correct this map. So you reorganize this map based on A, and this is this organization. So this is what A requested in the middle. And then you have the tail, on the tail of the B values. And you can scan and filter out which B values qualify, and then you create a bit vector which maps exactly this year and points on the qualifying B values. So the extra step here is that we have to scan the tail for this particular area and create the bit vector. Then the second operator will create map MAC if it doesn't exist, and then first thing that it will do, it will again run the A predicate to correct this map. So again you create this area here based on the predicate of A, and then this bit vector here exactly shows which C values qualify from A and B. And then you can go and take exactly these C values, and if they qualify for the new C predicate, then you have to update the bit vector. And the last operator, to answer this query, this is the operator that wants to take out the D values, it uses map MAD. Again, it will crack MAD based on the predicate on A, and then the bit vector that qualified from here reflects exactly the qualifying D values. And then you have the result. And [inaudible] you can use the index to infer some optimization information like which map set you want to use, because here you can start -- here I'm using the map set of A, so the map sets that always have A as the leading column, but it might be more beneficial to use the one of B or C. So if you actually have one of them around, you can do a quick search on the index and see which one has more partitioning information, say, or infer selectivity and things like that. >>: [inaudible] >> Stratos Idreos: Yeah, that's what I said now. If you have that, if you have the BC, BA maps laying around, then you can do a quick search on this index and see which ones you want to use, but if you don't have we just randomly create those. Okay. Let me quickly talk about partial cracking as well. So normally we create these full column maps, and then if we [inaudible] we have to drop them and create new ones. So what we do is partial cracking is that we basically partially selectively materialize these cracking maps. And this looks a bit like this. So you selectively go ahead and create and drop little pieces of these maps depending purely on what the queries -- exactly you take out the values of the queries one. And here's a representative performance graph where we have a limited threshold and we continuously change the workload, which means that you have to drop some of your indexes and create new ones as the workload changes and you don't have more space. And the x axis indicates how often the workload changes, so on the right is when you change basically for every single query, and the y axis is the community response time to run thousand of these queries, and the blue [inaudible] is the full maps which is -- which suffers quite a lot because you have to drop the complete index, create the complete index again and index it, start cracking from scratch, while for partial maps, of course, you have a much finer granularity, you can be much more effective in there because you don't have to drop the complete indexes. You hit the thresholds much later and so on and so forth. Okay. And, again, a more fine-grained example of how this looks like. So, again, we have our columns, and this is how a partial sideways cracking index look like. So here we go ahead and materialize only the small piece that the query requested and then we crack and we index this small piece and we have enough information to know what we've indexed and what we haven't indexed so far. Then when queries come that want to fine-grain this index information, we'll go ahead and reorganize the small pieces. When queries come and request even more data, we'll have to fetch this data, but we'll have to make sure that everything works alignment as well, so we have to take out of this [inaudible] mark, which is a [inaudible] model that coordinates whatever happens with all the maps with the same leading head. And, again, when we have different attributes requested, again, you have to make sure that everything is synchronized, and also different maps can have different portions of data materialized. So that will be the last slide for technical stuff. I will show you how partial alignment works. So in general, you have all these pieces, and the red part here indicates what the actual query requests, a range request. You have to crack at most the two pieces at the edges of this range request because everything between will be okay for this range request. So think of this as a column and it's ordered and these are pieces. Inside each piece there's an order, and a new request comes, let's say it should be here, so we have to crack these two pieces. Now, if you think of how this happens with multiple map sets where you have to also worry about alignment, well, the first observation is that you don't have to worry about, I mean, thinking in adaptive way why you should align pieces of the data that are not used in this query. So first thing that we do is we go and align the pieces that are actually used in this query. Second thing is that you only need to do full alignment for the actual pieces that you're reorganizing because only these pieces are affected by [inaudible] organization. So fully alignment means to have the play the whole history. And everything in between you only need to do partial alignment, and partial alignment means that you only have to make sure that for these particular columns that you're going to use for this particular query for this particular value range, then they are aligned. So basically the you take the set of columns that you refer to this query, you find their common place in the history, and you align up to this point. So then alignment looks a bit like this. Depending on how often the workload changes, if it changes too often, then it doesn't make any difference, but if it changes in big steps -- so this is every hundred queries, every 200 queries, then using full alignment you can see the spikes, but if you use adaptive alignment like this, then things are becoming more smooth. And here is a more representative graph of the whole performance, and this is TPC-H query. X axis is running multiple instances of this query, random instances, and the y axis is response time. So the red curve is plain MonetDB, the green curve is selection cracking, and the purple curve is presorted MonetDB, so creating the proper -- the perfect projection, column sort projection, which takes anything between three and 14 minutes depending on the TPC-H query. But then, of course, performance is extremely nice. And the blue curve is sideways cracking. So you see that the first query with sideways cracking takes only less than a second, basically, while in this case it takes whole minutes. And already after five queries, performance reaches optimal performance, okay? It nicely approaches the performance of the fully index. And then already after a few -- 10 queries, 15 queries, performance nicely stabilize at these levels, which is because the [inaudible] is quite astute in this case. You always request similar ranges, so the index nicely adapts very fast after a few queries. So without having to know, like in here, without having to, you know, doing it in actual time and having to know which queries and which columns are useful, you can have excellent performance. And now I'll close with a couple of slides of what is next, basically, and this is also related work, and most of the work on indexing has come from this building or this area. Now, what cracking brings in this area of physical design is that it pushes all this functionality inside the kernel and it proposes a way how you can do this in the partial and incremental way inside the kernel without doing [inaudible] tools. So if we see this from a high point level of view, so this is how offline indexing works. You do things one step at a time. So first you do the analysis, then you do the index building, then you do the query processing. In online you can mix things. So while doing query processing you can do analysis and index building, but then query processing, of course, suffers. And adaptive indexing, there's no separate steps. Everything happens in one thing, one step inside the kernel. And regarding -- we can see also this chart regarding idle time and workload knowledge. So offline indexing, it's a lot of idle time and a lot of workload knowledge, online indexing, it's somewhere in between because you can do things while processing queries, while adaptive indexing basically zero time and zero workload knowledge. And now I would like to end this with what I think is a bit of nice future work. Basically I'm not claiming that cracking is the ultimate solution, but using all these techniques and all the capabilities that you get from each one of these techniques, you might be able to query the system that -- you know, it exploits every bit of idle time that you have, it exploits every single bit of workload knowledge that you have, either that you create or get this knowledge offline or online, and you also does things partially inside the kernel and adaptively such as it doesn't impose any overhead at query time. And these are some of the open topics that are currently open. And the bold ones are the ones that I'm actually quite active in working right now, disk based cracking, concurrency control, and auto tuning tools. And for the last slide I would like to point you to what I think is two very nice papers so regarding initialization [inaudible], it's not really about indexing. The first thing you have to do is load the data inside the databases. Again, think about examples where it really makes a huge difference to be able to query your data very fast. And we have this very nice, I think, paper with people from EPFL where we take this -- similar ideas like we do with database cracking, but we move them to the loading part. So how you can query a database without even the data being loaded and adaptively bring the data inside of the database, index it and develop it and forth based on the queries. And then even after you've done the loading, while you're dealing with [inaudible] sets you not only correct answers. So we have this nice paper in PVLDB which also won the best paper in the vision sessions, but we say how you can adapt the actual kernel in the same way that we do with the database cracking, bring the functionality of doing [inaudible] processing, build it all the way down to the database kernel. So how you can change, let's say, your algorithms for approximate query processing without doing any [inaudible], let's say. Anyway, I think this is our indexing papers, if you want to take a look, and with that I'm going to close. Thanks. [applause]