>> Ravi Ramamurthy: So good morning. It's a... today, Themis and Stratos, and both of them are ex-interns...

advertisement
>> Ravi Ramamurthy: So good morning. It's a pleasure to have two visitors
today, Themis and Stratos, and both of them are ex-interns of MSR. Themis is
currently a faculty member at the University of Trento, Italy. He did his Ph.D. -got his Ph.D. from the University of Toronto. He works broadly in the area of
event processing, and he's got a best paper award in ICDE 2009 and he's going
to be the co-general chair of the VLDB 2013. VLDB 2013 is in Italy, so that's
good news for us.
Stratos Idroes is from CWI. He got his Ph.D. working with Martin Kirsten
[phonetic], and his thesis won the SIGMOD doctoral dissertation about this here.
And he's continuing that as a faculty member, and he's going to be talking today
about adaptive indexing, and Themis is going to be talking about indexing also,
but indexing time series.
>> Themis Palpanas: Thank you very much. Thanks for the invitation and for
organizing this, Ravi.
So this is work on indexing and mining a billion time series, and a billion in the
particular case of time series is a big number since previous works have only
considered up to 10 million time series. So this is joint work with my student
Allessandro Cameraa from Toronto and also Jin Shieh and Eamonn Keogh from
the University of California at Riverside.
I will still try to end my talk by 10:15 so that will give time to Stratos as well.
Please feel free to interrupt me if you have questions.
So this is all about time series. And has a time series? A time series is just
sequential data points measured over time. And time series are ubiquitous.
So there are time series in finance, in any kind of scientific experiment and
domain, and we also have time series in some unexpected domains. So, for
example, motion capture, just recognition in the video analysis, as well as time
series that describe the contour of shapes. In this particular case we can
transform contour of these shapes to a time series and then use time series
techniques to identify similar shapes no matter for the scaling, rotation, et cetera,
of these shapes.
So there's a pressing need to index time series data, and the reason that we
want to do this is because it is only by indexing them that we can actually mine
this data fast.
So the basic operation that we want to support very fast is the similarity queries
in a time series. And this is going to be the focus of the index that I'm going to
talk about now.
If you actually can support this operation very fast, then you can also support all
different kinds of mining operations on a time series such as classification,
clustering, deviation detection, trend analysis and so on and so forth.
The issue here is that this time series collections that we're going to index and
then mine can grow extremely big. So these are two examples, one from
General Electric where they want to monitor the operation of gas turbines that
they operate in different parts of the world. In this case they talk about million
samples per minute.
Another example is from the Tennessee Valley Authority. This is a company in
the eastern part of the United States that monitors the health of the electrical grid
in the East Coast, and these guys talk about rates of 3.6 billion points per day.
So these are measurements over time, they are naturally time series, and these
guys want to actually at some point mine these time series, identify similar
trends, et cetera. So this is the basis of what we're going to be talking about.
So we want to index and then mine these time series, and, of course, there is
some very good news. So there are time series indices that have been proposed
in the literature, and one of those that's going to be the focus of this study is
iSAX. So iSAX is a recent index that previous studies have proven its utility. So
I'm going to just talk about this particular index.
The bad news is that building this index takes too long. So, for example, there is
no study that uses more than 1 or 10 million time series, and in the particular
case of iSAX, if we're going to index 500 million time series, and in this case I'm
talking about time series of size 256 real values, it takes 20 days, which makes it
hardly practical for large collections.
So the contributions of our work is that we propose novel mechanisms for the
scaleable indexing of these time series. So basically we propose a bulk loading
algorithm for this iSAX index that reduced the number of random accesses but
two orders of magnitude.
So obviously here the main problem is how to make sure that all your disk
accesses -- that most of your disk accesses are sequential.
And the second contribution here is the node splitting policies. With the new
method that we propose we can reduce the index size by one-third.
So this second contribution came about because the iSAX index, as I will talk in
more detail later on, is that balanced.
So we have the first experimental evaluation with this kind of size of a time series
collection, 1 billion time series, which validates the scalability claims, and we also
have some case studies in diverse domains: Entomology, genome sequences,
and web images.
And I will just briefly talked about the second application, the genome sequences.
So a little bit of background on the iSAX. So the iSAX index is based on the
iSAX representation or summarization of time series. So if you assume that we
have this time series there with all the blue points, the first step in order to
produce the iSAX representation of this time series is to do a PAA
summarization.
So basically we produce segments of equal length, and each one of these
segments has s value, the average value, for the points that it contains.
Then in the next step what we do is that we divide the y axis into different
regions, and for every segment that falls in one of these regions, we assign to
that segment a bitwise representation that encodes that region. So this particular
example, we have divided the entire space in four regions, and we need two bits
to encode those four regions.
So any segment that falls in this region is encoded with two bits, 1, 1. Any
segment that falls in this region is encoded with two bits, 0, 1, and so on and so
forth.
This division of the space is not equal because this time series are normalized,
so they have a zero [inaudible] deviation of 1, and what we're trying to do is end
up with an approximately equal number of segments falling into each one of
these regions. So this is the basis for the representation.
And the next thing with this representation is that we go from the real value
domain to the domain of bitwise representation. Thus, we save lots of space.
And we can also have a representation of some time series with four segments
where each segment has a different cardinality.
So we can have a varying level revolve detail for the representation of each
segment. And all this is naturally supported by iSAX.
Now, how does the index look like? It is a tree which is not balanced, and that's
what I'm trying to show here, and in this tree we start out by a very coarse
representation of the time series. And, of course, as we go down the index, then
we -- then we refine this representation.
And obviously before I using this structure, we can quickly search for similar time
series, and we prune the search space. So, for example, if I search for a time
series that has a representation that is somewhere here, so it has this kind of
shape, I will go down this path and I will exclude the other paths. So, for
example, paths that represent this kind of time series, right? And this is what
makes that fast.
Of course, we end up with the exact solution in this case, right? One important
thing here is that with the iSAX index what we do is that we store the entire index
in main memory, and this is feasible because we use the bitwise representation
for time series in all internal nodes, and in the main memory we only have this
bitwise representation for every internal node.
For the leaf nodes, these nodes are stored on disk, and apart from the bitwise
iSAX representation, they also have to contain the raw time series, all the
individual real values, because at the very end we have to access the raw time
series and discard any false/positives from our similarity search queries.
So this is a wonderful index, but then the question is why it takes so long to index
large time series collections. And, of course, the answer is the iSAX implements
a naive node splitting policy. So when it splits a node into two, it may be the
case that all the time series of the node will end up in one of children, and this is
what we try to address. And the most important thing is that it implements no
bulk loading strategy.
>>: So this looks like a data reduction followed by the two-dimensional indexing.
Is that the right ->> Themis Palpanas: This is not two-dimensional because you may have
several segments, right? And the number of segments, which is the
dimensionality, may also be in the order of 10s of dozens. And this is why
traditional structures do not work very well here.
Actually, some studies have shown that in several cases it pays off to just do a
linear scan.
>>: But do you do the data reduction before you load the data or do you do that
as you go?
>> Themis Palpanas: As we go.
>>: Okay. So the question I was trying think of is do you know how you're going
to reduce the data? Do you know how you're going to map it?
>> Themis Palpanas: Yes, we do.
>>: So you know the bands that you're going to map to your bits ahead of time?
That's not something you find ->> Themis Palpanas: Yes, we do. We do, yeah. So this part is fixed. Okay.
So let me first say a few things about the bulk loading algorithm. So the design
principles here is that we're going to take advantage of all the available main
memory and maximize the number of sequential disk accesses.
The intuition is that we want to group all the time series that are going to end up
in the same leaf node together and write them to disk at once with a number of
sequential accesses.
And the problem why we cannot do this easily is because when you have 1
billion time series, all the time series that are going to follow in the same leaf
node may be dispersed along this collection.
So if I had a way to precluster all my time series according to the leaf node, then
I will solve this problem.
And actually several of the bulk loading strategies for traditional indices is based
on exactly this idea, that you do some kind of preclustering of your data and you
then bulk load the index. This cannot work here because in this domain it's
extremely hard to cluster unless you have an index. So the index helps you do
the clustering fast. So we cannot follow that approach.
And here is the high-level description of what we propose. So we propose an
algorithm that works in two phases. During phase one we read time series -- we
read in the time series and we group them according to the first-level nodes. The
first-level nodes are just below -- are the children of root.
And in this operation we use the entire available main memory. When we have
exhausted the main memory, then we switch to phase two. During phase two we
process together all the time series that belong -- excuse me, that belong to -that are contained in the same first-level node.
So we grow the subtree below that node up to the leaf nodes and we flush to
disk.
At the end of phase two we have processed the time series contained in all the
first-level nodes, we have flushed it all to disk, we have freed up all the main
memory apart from the main memory used by the index which is a small portion,
and we go back to phase one.
So let me show you how this looks like.
So this is the root node and these are the first-level nodes, children of the root,
which have not been materialized in main memory yet. Right? And this is why
they are dust [phonetic].
So we introduce this first buffer layer which is a layer of buffers. We have one
buffer connected to each one of these first-level nodes. These buffers do not
have a fixed size, so they grow as needed according to how many time series
are routed to each one of these nodes.
So what happens is that when time series come in, we just route them to the
corresponding node. Instead of inserting them in the actual node, which in this
case would be a leaf node, we just insert them in the buffer.
When we have used up all the memory, then we go to phase two. In phase two
we have this leaf buffer layer which is another layer of buffers. And in this case
each buffer corresponds to each one of the leaf nodes in the tree.
And the size of each leaf buffer is the same as the size that the leaf node has.
So it contains, for example, 1,000 time series.
So this picture here shows what has happened when they are midway through
phase two. So I have already processed these two buffers. I have grown this
part of the tree and flushed everything to disk. So let's see what happens here.
So in this case I have a whole bunch of time series in this buffer. I process all
these time series and I grow the subtree rooted at this node.
So I keep inserting time series here. Once I exceed the limit of 1,000 time series,
I split this node into two, I divide this time series, and so on and so forth.
At the end when I'm done inserting all the time series from the first buffer layer to
the leaf buffer layer, I'm going to flush those to disk.
Now, the interesting thing here is that when I do this, I need no extra memory
because I just moved time series from one position in the memory to another.
Right? So I'm still making use of -- I'm still making good use of the memory. And
then, of course, I have just flush everything to disk.
At this point I have released all the main memory and I'm ready to start the first
phase again.
Okay. And these, of course, are mainly sequential writes.
Let's also say a few things about the splitting policy that we use.
So the design principles here are that we want to keep the index small and we
want to -- and when we split some node we want to pretty much equally divide
the time series in the two children nodes.
And so the splitting happens according to the iSAX representation of the time
series, right? And the intuition for the solution that we propose is that we want to
split that segment of the representation for which the iSAX symbols of the time
series will fall almost equally to the two sides of the breakpoint.
So let me show you a picture of how this works. Assume that we have time
series and we represent them using four segments. If I use a cardinality of two -a cardinality of one bit for the iSAX representation, this means that I only have
two regions, 0 and n1.
Let's say that these are all the time series, so each combination here is one time
series in my node. And my node [inaudible] uses one bit of cardinality to
represent this time series.
Now, I want to increase this cardinality by one. So if I use cardinality two, then
this means that I have now four regions to divide the time series. And we split
the segment for which the highest cardinality symbols lie on both sides of the
breakpoint.
So, for example, in this particular case, if I decide to split this segment, then all
the time series will end up to one of the children.
If I split this segment, on this segment, then I will have an approximately equal
number of time series going to one child and to the other.
And it turns out that we can actually do this pretty efficiently just by recording the
first two statistical moments which we can do efficiently in an online [phonetic]
fashion. So we code them in the standard deviation, and then when we just have
to do is pick the segment to split for which the breakpoint, the [inaudible] line, is
within this range of mean plus minus 3 standard deviations and closest to the
mean.
So this is obviously a heuristic which tries to capture the [inaudible] that I
mentioned earlier that we want to have time series that fall to both sides of the
breakpoint. And at this time turns out, as I will show later on, that this works
pretty well.
So what happens if no -- if the breakpoint for no segment satisfies this condition?
So in this case what we will do is that we will keep increasing the cardinality of
the representation and at which point we can satisfy this condition.
So this means that in our new approach we may end up skipping some steps of
the regional iSAX index algorithm and move faster towards a good split.
So the experimental evaluation, we implemented this in C-Sharp and tested
using one big machine that had 24 big bites of memory, two terabytes of disk,
and also some small desktop, 3 gigs of main memory, a half tera of disk. We
compared against the original iSAX algorithm that was making use of all the
available main memory for disk buffering and also iSAX with BufferTree.
So BufferTree -- so this is the original iSAX with BufferTree. BufferTree is
another technique that has been proposed in the literature for bulk loading. This
was proposed for our trees. And as you will see, this doesn't perform very well.
The reason is because BufferTree was proposed for a balanced index.
In our case, iSAX not balanced, so BufferTree, which uses a fixed amount of
memory for specific levels of the tree, does not make fair use of the main
memory because it preallocates this memory and does not know where we will
need it most. And this is exactly what we're doing.
So if we focus just on the benefits of the splitting policy, here's what we have. So
in this case we did some experiments with the time series with collections of time
series from 1 up to 100 million in both cases time series.
In the left graph we saw the index size. And we can actually see that with the
new splitting policy it will result in 34 percent less nodes for the index. This is
because all of our splits are useful.
And, actually, the node occupancy also increases by 20 percent. So this means
that we end up with a more compact index, and this results in a smaller number
of leaf nodes, which translates to fewer -- to less disk IO.
So just by using the new splitting policy, the build time for these collections of
time series is reduced by one-third.
Now, let's see what happens with bulk loading. So previous techniques
obviously take too long, so it takes 20 days to index 500 million time series. This
is the original iSAX in red using all the available main memory for disk IO. This
other line here is the original iSAX with the BufferTree.
And as I was explaining earlier, we see that the performance is even worse, and
this is because the BufferTree preallocates memory in specific nodes in the tree.
Our tree not balanced, so this does not work in our case.
So this is the result with the new index. So with a new index we can scale up to
1 billion time series for our experiments. We could not do the same experiment
with the original iSAX. In this case we estimate that it will have taken
approximately two months to index all these time series. In our cases we
finished the job in 16 days, which is 72 percent less time, and this translates to a
time of indexing per time series of approximately 1 millisecond.
Let me also say that the reason for these results is mainly the disk accesses, that
we managed to not only reduce the number of disk IO but also make sure that
the vast majority of these disk accesses were sequential.
So in this graph here we saw for collections of time series of 100 to 1 billion time
series, we saw that iSAX can reduce the number of disk page accesses by 35
percent. And almost all of the disk accesses that iSAX does, so this is 99.5
percent of the disk accesses, are sequence, which makes a tremendous
difference.
And so we also did some experiments with some real data. I'm just going to talk
about this second case study that talks about genomic data. This is 22 million
time series of size 640, total size 115 gigabytes, and these are experiments on
the smaller -- the desktop computer.
In this experiment we wanted to identify mappings between chromosomes of two
different species, the human and Rhesus Macaque. So Rhesus Macaque is a
species of monkey that is relevant to human. Biologists are very interested in
these mappings exactly because these two species are so close to each other.
The problem with these mappings is that these two species have a different
number of chromosomes. So these mappings are not obvious.
So what we did is that we translated these DNA sequences -- we translate the
genome sequences in a time series by chopping the map in small time series of
size 640, of overlapping time series of 640.
And then what we wanted to do was try to see if the same or similar time series
occur in the chromosomes of these two different species.
And we ended up with this kind of picture that basically shows what is the
mapping between the human chromosomes and the monkey chromosomes. So
each line here shows one possible set's mapping.
So mapping is a subregion of a monkey chromosome which is the same -- which
is similar to a subregion in some human chromosome. And it's interesting to see
here that in some cases we have regions of the same chromosome of one
species mapped to two different -- to subregions in two different chromosomes in
the other species.
So here we're not basically claiming that we have solved some biological
problem, but the claim is that we can help the biologists do their research in this
particular domain.
So, for example, this picture shows the mappings that have already been verified
by biologists in their experiments. And if you observe these two, the mappings
that the biologists have identified are a subset of the mappings produced
automatically by our approach.
Once again, we're not claiming that we have solved this problem, but we can
claim that we can direct scientists in the interesting parts in their data so that they
can very quickly focus on what is highly probable mapping in this case.
I'm going to skip this.
So once again, one of the take-home messages here is that this is the first
approach that can scale to such large time series collections, one million time
series, and what we're trying to do here is enable practitioners and scientists the
pain-free analysis of their existing data set collections.
Actually, we have some -- we're working on some improvements of our
technique, and our experiments show that we can get a 40 percent reduction in
the total build time. So this will mean an index time for time series of
approximately half millisecond.
I would like to conclude with this thought. So what next? We think that the next
challenge will be to index 10 billion time series. And even [inaudible] that this is
not a matter of chasing numbers. This is all about enabling people to actually
analyze the data that they already have.
So we're working with some neurobiologists. So these guys have these big
machines and they do functional magnetic resonance imaging. So a person gets
in there and they analyze their brain. Whenever there is a stimuli to the subject,
then this machine produces an image that shows how the subject responds to
that stimuli.
Well, a single experiment, which means one subject in a single test, produces 12
gigs of data, 60,000 time series. 60,000 are the number of points in the brain
that we're going to talk about.
So we can individually focus on each one of these 60,000 points of length 3,000.
And what's even more interesting is that right now there's one competition.
There's a classification task to detect based on this kind of data whether a
subject suffers or not from ADHD. So in this case for the competition there are
776 subjects for a total size of 9 terabytes of data, which and you translate this to
what we've been talking about in this presentation, we talk about 4.5 billion
non-overlapping series or 1000 billion overlapping series if you're going to do a
more fine-grained analysis.
What these people have been doing so far is that they are reducing each one of
these 60,000 time series to a single number and then trying to do the
classification.
And obviously there is much more work in their data than that. So this is the
point that I'm that we need to help them use all their data.
Of course, I agree that the parallelization helps, and there's a lot of room in doing
that here. Actually what we proposed is amenable to parallelization, but, once
again, this is all about how to most efficiently use each individual machine.
Right. So this is the end of my talk. Let me just say that this is where we're
located, at the center of Europe. We're open to collaborations, and if you're ever
in the area, we'll be glad to have you over for a visit to talk about research and
also for other reasons [laughter].
So thank you very much. Any questions?
[applause].
>>: What about scaling this out?
>> Themis Palpanas: So you mean to multiple machines?
>>: Of course.
>> Themis Palpanas: Yeah, we haven't looked at that yet. But what we've been
talking about amenable to parallelization.
>>: Is there any technical problem? You just partition up the stream and you
build indices on multiple machines and then merge them at the end or do you fan
out queries at the end?
>> Themis Palpanas: Well, this talks about the index build part. So what I was
thinking about was to parallelize the first level of the index. So you have a
different machine cope with each one of the first-level nodes and their subtree.
I haven't -- we haven't really worked on that yet, but that is a very interesting
direction for sure.
>>: To what extent does the efficiency of approach depend upon the partition be
being reasonably even, reasonably uniform? Do you require -- do you depend
on in some way -- say you have sort of a preliminary index built in to use to help
you with the subsequent partitioning. Are you depending upon the fact that that's
good sample of what the data is? And what happens if the data is skewed in a
different way than that sample?
>> Themis Palpanas: Right. So this is exactly why using iSAX with BufferTree
suffered. Because basically this approach assumes a uniform partitioning of the
data which is not true in real life, and that's why it doesn't perform well.
In our approach we make no such assumption, and it doesn't matter what the
partitioning is or the skew of the data. So this has no effect whatsoever. As I
said, when describing the algorithm, we don't have a fixed amount of memory
assigned to each buffer in the first level, right? So these buffers grow as needed
according to what data come in. So there's no problem ->>: But it could mean that your partitioning is not very effective, right? It could
mean that you have a whole bunch of partitions where modest amounts of data
go and an enormous amount of data go to one partition ->> Themis Palpanas: Sure. That's no problem.
>>: -- in which case your partitioning wouldn't have done you much good, right?
>> Themis Palpanas: Well, this will not affect the index build time. This may
affect searching, right? And then the question is whether this particular index -that's not the right question -- whether the particular parameters used for this
index are appropriate or not for a particular data set. But this doesn't affect what
I have presented here.
>>: So the non-fractional time series to build an index [inaudible] so have you
thought about taking advantage of this in some way to improve performance
because there's so much cardinality between these?
>> Themis Palpanas: Right. This is a good question.
So this goes back to whether you can actually cluster or not your time series to
start with. But this will assume that you have actually done or you know how to
do such clustering. In our case we don't assume that we know that. So we don't
assume that all the overlapping ones are together.
Also, another thing is that even if you do know, this overlap only has -- it only has
a limited advantage, because after a very short time you end up having time
series that are significantly different from each other.
Okay. Thank you very much.
[applause]
>> Stratos Idreos: Hi. I'm Stratos. I will talk today about database cracking and
[inaudible]. Database cracking was the topic of my inaudible thesis back in
Amsterdam, and this work is in general joint work with Martin Kersten and Stefan
Manegold which were my advisors, but also Goetz Graefe and Harumi Kuno
which joined us after some point. And actually part of the thesis is also
co-authored by them.
So what I'm going to try to do is give a five or ten-minute general overview of the
whole thing and then zoom in one of the subtopics, one of the technical topics,
which is [inaudible], and it's one of the most important topics in column storing.
This work is in column stores. But those of you who want to see more of the
other details, I will be -- I have the slides to talk about this.
So in general, our problem is physical design, how to tune databases, how to
make it as automatic as possible. So nowadays I guess you know this doesn't
work out of the box. We really have to go and decide what kind of [inaudible] we
want -- why do we want to create these indexes and when, on which of the parts.
So this is typically the job of the DBAs and offline auto-tuning tools, but this is
quite a hard topic. And typically how it works is as follows: So you have this kind
of a timeline where you first collect workload information or get this somehow.
Then you analyze this workload information depending on your system, and then
you go ahead and you actually create a physical design. So this takes quite
some time, and only after this whole thing is finished, only then you can process
queries.
So our goal here is to basically try to make this much faster. So the question is
what happens with dynamic workloads where you can really break this timeline,
right? You might not have all the knowledge. You might not have all the time to
do this. And what happens with very large databases, like scientific databases,
for example?
So it's difficult places where you have terabytes of data coming on a daily basis,
especially we expect this in the future, then making possibly a wrong choice
about creating an index can be a big cost because it takes so much time to
create indexes.
So we identified idle time and workload knowledge as two of the critical
parameters that we are trying to work on. So idle time you need to know in order
to do the analyze on the indexes, and workload knowledge you need because
you want to know on which data parts you want to build indexes.
So if you don't have this information -- and nowadays, if you think about it, like
social networks [inaudible] and social networks and different databases, you
don't really have this kind of information up front.
So what can go wrong? For example, you might not have all the time that you
need to finish the proper tuning. You might have some of the time, but not all the
time.
Second, by the time you fix the tuning, the workload might have changed. So
let's say scientists look on a database, on the new data that they receive from,
let's say, telescopes, and they realize that they're looking for something else, and
then they have to change the workload [inaudible].
Still, if none of this is true, there's no indexing support during the tuning phase.
So you have to rely -- if you really want to query your data and the way those are
for indexing, in the meantime you have to rely on very [inaudible] performance,
rely on [inaudible].
Another part is that currently our indexing technology does not allow to focus on
specific data parts. You still have to index whole columns. You can select
columns, but you cannot select parts of columns.
So database cracking has a vision of basically removing all physical design steps
and still getting the similar performance without a fully tuned system, and then
you don't need DBAs.
And the way that we found out that we can approach this goal is that we go deep
inside the kernel and try to do that. So we started designing new database
concerns which we call auto-tuning database kernels, and that involves really
starting from scratch. You have to think about operators, new algorithms, new
structures, new plans, new optimizer policies, everything from scratch.
So let's say -- well, there's no monitoring steps and there's no preparation steps
because we say we assume that there's no idle time. There's no external tools
because we say that takes too much time. Everything is inside the kernel.
There's no full indexes because we don't want to index parts that we're not
interested in, so we selectively index small parts of the data. And there's no
human involvement because, again, everything is automatic inside the kernel.
So what we do is continuous on-the-fly physical reorganization. So continuously
as the data arrives -- as the queries arrive, we continuously reorganize the data.
And you can think of this as partial incremental adaptive indexing. We partial
index, we adaptively index. Everything is driven by the queries.
And some of the techniques that we'll show you is fully designed for column
stores. Some of it should be possible in [inaudible] as well, but maybe with some
bigger changes.
So the main mode of database cracking is these two lines here that every query
that we get, we treat it as an advice on how we should store the data. So we see
the system sees the queries and says, okay, this is how to store the data and
there's small changes continuously.
So let's see an example for that. And this is not how cracking works nowadays,
but this is the very first implementation, and I think it's a simple example to start
with.
So let's say we have this single column here in the column store. It's a fixed
width dense array. And this is a very simple query. So just [inaudible] predicate
requesting values between 10 and 14 of column A.
So to answer this query we will reorganize the column physically. So we'll take
all the values smaller than 10 and move them to the first part of the column, all
the values between 10 and 14 in the middle part of the column and then the rest
of the values which are bigger than 14.
So now you should have created some sort of [inaudible] partitioning in there,
and as far as this query is concerned, we'll collect the result values in the
[inaudible] area. So we can say whatever it is, we just answered the query.
But more crucially, we gained some knowledge that we can exploit in the future
because there's some range [phonetic] partitioning in there.
Now, the important thing for database cracking is that this is not something that
happened after we processed the query, but this is how we process the query.
So this is the very first select-operator that we actually created for the cracking.
So the select-operator has -- is a simple procedure that takes a single column
and the rates predicate and then returns the column back reorganized, but all of
the values inside the rates are collected in a continuous area.
And then since this comes in bulk processing column [inaudible], you give this to
the next operator and say why don't we do this in that position of the array, you
have the qualifying values, and you can go ahead with the rest of the query plan.
So now if you get a second query, again, this is a very similar query, so basically
it's just a one-operator query. You only have a select-operator now we have a
different predicate on a different range request but on the same column. So now
we can exploit the range partitioning in there.
And remember we're doing continuous physical reorganization, so we take now
the low bound which is No. 7, and we see that it falls within the first piece. So we
go ahead and, again, we crack the first piece using the low bound now. So we
take values smaller than 7, values bigger than 7, and we'll crack them to new
pieces.
Then we go to the middle piece and we realize that, well, this qualifies
completely, so we don't do anything with that. And now we're gaining some
speed because we don't touch this piece.
And then we go to the next piece and say that, well, this is where the high bound
falls in, and, again, it's not an exact match. We don't have an exact piece on this
value before. Now we crack this on value 16 and create two new pieces.
And the result is the same as before. We just collected the values in a
continuous area, so everything between 7 and 16 is now in the three middle
pieces, and more crucially, we gained even more knowledge about the future.
So now we have even more partitioning information, even more structure.
Now, if you generalize this and think how this evolves over a thousands queries,
then you have more and more of these pieces that become smaller and smaller.
This means that they can skip more and more pieces. And in general you only
have to crack -- with each range request in select operator, you only have to
crack at most two pieces on the edge of the request -- of the range that you
request.
So this becomes continuously faster and faster. And let's see an example here.
So here we have single selections. It's random selectivity and random value
ranges requested, and we can compare the plain scan which is the red here, so
basically just every query scans the column, and then the purple here is cracking
and the green category is fully indexing. And by fully indexing here I mean that
the very first query will go ahead and sort completely with quick sort the problem
and then you can just do binary search.
So the first query does quick sort and binary search, and then from the second
query on you just do binary search.
So observation number one is that the first query is of course much for expensive
for fully indexing. While cracking is relatively comparable to the scan, typically
it's about 20, 30 percent slower.
So this means that the user doesn't really notice any difference, big difference,
for [inaudible] really comparable to the default performance that he would expect
from the system.
Then the more queries that you answer, performance becomes better, and the x
axis is the number of queries and the y axis is the response time, so you see that
the more we go to the right, the more queries that we answer, automatically
performance improves.
Of course, full indexing is super fast performance because we just have spent all
the time investing in creating the index.
And then eventually after main queries, cracking nicely converges to the optimal
performance.
So the main message here is that without having to spend the full indexing time
up front, assuming that we don't have any knowledge and time to do this, of
course, we can still improve performance without a noticeable overhead for the
user.
Now, if we change the metric so the y axis now is cumulative average time -and, again, the curves are the same techniques -- then we see that after 10,000
queries, still full indexing has not amortized the initialization costs.
So this means that if -- anywhere in between these 10,000 queries the [inaudible]
so this column not useful anymore for the workload, then the whole indexing
effort up front, it was a weight of time.
And, of course, also let me note that these examples are completely random. I
mean, we really assume here that the users are interested in the whole space of
the column, the whole value range, which might not be the case. If you have a
skewed workload, and I will show you such a case later on with [inaudible], then
performance for cracking will improve basically instantly, and then this difference
will be much, much higher.
Okay. So I have the slides now for the -- basically what we did over the years
with database cracking and then I will zoom in in one of the subproblems.
So first thing was selection cracking, which was basically the slides that I showed
you before, how you take the select operator and how you adapt the select
operator for continuous [inaudible] organization.
Then we started updates. And updates is quite crucial here because we keep
changing the columns continuously. Basically we do -- every read query, we
make it a write query. So what happens if you have updates as well.
And then we came up with some, again, adaptive algorithms to plug in updates
inside, again, the select operator. So updates happen in adaptive way in the
sense that when you get updates, you put them on the side, and when queries
come that actually need the actual column, the actual value range in this value,
then while you're going and reorganizing for this value range, you selectively plug
in only the updates that this specific query requires.
Then we worked on what actually made cracking possible in the complete view of
a database [inaudible] which allowed us to run completely [inaudible] queries,
which we call sideways cracking, and this is what I will talk about in the rest of
the talk.
There's some work on joins. I'm still trying to publish that. And our latest work
on adaptive indexing, which is basically a thing about doing partitioned cracking,
trying to do several optimizations to fit this also in the disk and then play around
with how much initialization cost and how much fast convergence you can have
and what is the balance between these kind of things.
So here's a slide, an overview slide, visually about the techniques. So let's say
you have these two tables here, and think of the empty rectangles over there as
the indexing space for these tables.
So first thing is that we do partial materialization. So we don't build complete
indexes, but selectively, when queries come, we just pick out only the values that
the queries request.
So we create these small pieces over the tables and we'll also index those
spaces, and we continuously reorganize data inside these spaces in keeping up
this information.
When we run out of storage, we have a storage threshold, then we can just
selectively drop these pieces using LRU and create new ones. And then we
want to use two columns from the same query plan, then we would have to make
sure that these columns are aligned. So we don't do tuple reconstruction, which
is practically the most expensive operation in the column store.
So when we want to use two columns together in the query plan, we don't try to
do a [inaudible] join. Instead of we just make sure that these columns have been
reorganized in exactly the same way, such as -- let's say in the x position of
column A you find let's say tuple x. Then you find -- exactly in the same position
in column B you find exactly the same tuple.
And if you reorganize the columns in exactly the same way, then you can make
sure this is true.
One pieces become small to fit in the [inaudible], we sort them, which helps for
administration, for concurrency control, for various issues, and when we queries
across tables then we can use this kind of information for efficient zones.
Okay. So I will go on now to talk about tuple deconstruction and why is this
important both for cracking and for column stores.
So this is typically how a column store looks like. You have arrays fixed with
dense arrays. You have positional information which is implied. So you know
that in the first position, let's say, of column A you find the first tuple, the value of
A for the first tuple, and in the first position of column B you will find the value of
B for the same tuple, and so on.
And this simple function here gives you information about how to browse there.
So the only thing that you need is the starting point of the array and the data size
of the array, and then you can quickly jump into the proper position if you're
looking for a specific value.
Now, this is how typical query browsing happens in a column store. So we have
this query here with two selections and one aggregation. So we bring the first
column in memory, we do a quick selection, and then the result is a set of
qualifying IDs. And these are the positions of the values that qualified.
So we materialize -- this is bulk processing. So we materialize this set of ideas.
And now we're going to do a tuple deconstruction action. So using these IDs we
go and fetch the qualifying B values. This is what we call late tuple
reconstruction.
And then we have the B values, we can run the second selection, and then we'll
do another tuple deconstruction action because we need the C column.
And then we can do our aggregation.
So this little tuple deconstruction actions, they happen very often in query plans,
and you have to be very efficient about it. And columns stores are efficient about
it by making sure that you can do basically always sequential or skip sequential
access by having ordered IDs for [inaudible] results.
So it looks a little bit like this. So on the top of the slide you see a set of row IDs.
You can think of this as intermediate result. And this is ordered. So once this is
ordered you can do a skip sequential access which can be efficient.
If it's not ordered, and this can happen after operators like join and [inaudible],
then you have to do random access. Of course, [inaudible] what you do is you
just short this intermediate result and then you do again skip sequential access.
Now, for cracking the thing is that even after selections, because we have
reorganized the data, then this row ID is again unordered. So we gain a lot of
time, we have a lot of bit in the selection, but then when we go to tuple
deconstruction, especially if you have more and more columns in the query plan,
then this row ID being unordered becomes a problem.
And this is an example of that. So here we run a more complicated query than
ones I was showing you before. So here we have a few selections on two
different tables and we have a join and multiple aggregation there, so we're using
quite some columns.
Here the left-most graph is total times and then the two other graphs, basically
they split the tuple reconstruction time before and after the join. But the main
point that I want to show is that -- okay, you see the red [inaudible] which is plain
MonetDB performs quite stable, of course, because it always relies on scans and
has nice sequential access for tuple deconstruction.
Then you see this green [inaudible] here, so you see that it nicely improves at
first, so you give the first queries and it improves, but then as you make the order
more and more random because you continuously reorganize the initial columns,
then you suddenly have more and more random access patterns, and then
performance becomes even worse than the initial MonetDB performance.
Then this blue [inaudible] is the sideways cracking, which [inaudible], it solves
this problem, and this -- and I will talk about this technique now. And this is
[inaudible] here, this is the, let's say, the perfect performance that you're going to
get in the column store.
So if you know from [inaudible] the column store projections, which is that you
take a copy of a table, you completely sort based on non-column, and then you
propagate the sorted on the rest of the columns, and then basically you have a
perfect index, but that purely depends on the queries on the workload.
So if you have prepared that up front, then performance can be very fast
because you rely on binary searchs. In this case, for example, it takes 12
seconds to prepare this index, but then it's super fast.
So let's see how this sideways cracking deal works.
So, again, I'm going to run a similar example than the initial one. So we have
these two columns, and this query here refers to two columns of this one. So
selection on one column and then you need to do a second projection.
So the difference here is that we create -- now we don't work on one column at a
time but we work on two columns at a time. So we call this cracking maps,
something that maps from column A to column B.
And we do exactly the same organization as before. So we create this sort of
range partition depending on how queries arrive, but now we work on two
columns.
So basically we organize based on the head column, which the head column in
this case is attribute A, which is the selection column where the selection refers
to, and the tail column, which is the projection column, follows this order.
So we don't have to do any tuple deconstruction, we don't have to do any
expressive join, especially with random access, but we can find the B values.
There's just there after we do the reorganization.
And it's exactly the same as before. So another query comes on a different
request and different range, and we keep reorganizing this thing and we keep
carrying the tail values as well, so we get them for free on the tail of these
cracking maps.
And, again, as before, this is practically the select operator, only that, of course,
it's not really select operator. We call it now select plus project because what
you get in the end is the B values.
Now, internally if you have multiple of these cracking maps, because queries did
not just refer to two attributes but they referred to multiple attributes, what you're
going to do is that you're going to have these maps aligned.
And they're aligned, then you can have correct queried results. So not having
aligned maps means that -- let's say we organize this map MAB, then if you want
to use in the same query plan MAB and MAC, then we can never have correct
results because our alignment is wrong. I mean, we might be [inaudible], but
internally we would have wrong results.
So what we do, again, we don't do joins. Instead -- we don't do joins in the sense
of traditional tuple deconstruction and column stores. Instead what we would do
is we keep a history for each map where we mark what kind of reorganization
actions we have to play on this map to make sure that it is aligned with the rest of
the maps.
So practically there's a central history of cracking actions per map set, and map
set is all the maps who have same head column, so they have reorganized from
the same column.
And then each map has a point in this history, and all we have to do is replay this
history at the proper time on demand to make sure that the two maps or more
maps are aligned.
So basically in this case we would do exactly the same reorganization as we did
in MAB, we'll do it in MAC and maybe in the rest of the maps as well if we need
them for the same query plan, and then everything is aligned.
Now, I'm going to show you a more detailed example for that. So let's say we
have these three columns here and we run a simple query on the three queries,
attributes A and B.
So this is the first query. So we go ahead and create this map MAB and we
crack it based on this first arranged predicate.
Then another query comes. Now it's again on AB, but a slightly different range
request.
So we do again the -- oh, no. Okay. I was confused.
The second one, this here, is MAC. So we create a different map, MAC, and we
crack it again independently.
And then another query comes with requests both B and C on the same query
plan. Again, a selection on A, but now requests both B and C.
So we're going to use maps MAB and MAC to do that, and independently they
create correct results, but once we try to merge the result together, then the
alignment is wrong.
So this is what it would produce, and you see that basically result contains
slightly out of order tuple. So the first tuple says B4, C6, which is not correct. So
the information there is correct, but it's not in the proper order.
So you can think of doing things like plugging in some extra columns with some
IDs and positions and then doing some reordering, but that turns out to be too
expensive.
So what we do is this thing that I was talking before about, which is that we try to
align adaptively these maps.
Now, another approach that one could think of is that, well, the first time that you
had to do, let's say, MAB, you also do MAC, and then everything will be fine.
But then that's complete out of the mentality motivational for using the columns
there in the first place because the first query, for example, requests and wants
to load only A and B. Now, if you go ahead and load also C and D and whatever
other columns this table refers to, then you're defeating the purpose of using the
columns there and having better IOs and things like that.
So our solution is that we have this history, these steps I was talking about, and
trying to adaptively replay cracking actions cross maps. And let's see how this
works.
So now this is exactly the same example as before, but it has some extra steps.
So first extra step is that when we run the first query, before we do these steps
that we did normally, we first run this -- we do this operation.
And this operation is basically cracking MAC but using the predicate of the first
query, and then we can do the predicate of the second query.
So actually this action is not used for this query now, but it will be useful for the
second query and you want to have things aligned.
So when we go to the next query, this is the extra step now. So before doing this
step that we normally do, we first run -- we will use MAB but we use -- we crack it
based on the predicate of the second query first, and then both maps have been
cracked with the same predicates in the same order.
So when we reach this state here MAB has been cracked first with smaller A
smaller therein 3 and then A smaller than 5 and then MAC has been cracked first
with A smaller than 3 and then A smaller than 5. So they have been cracked with
exactly the same arranged predicates in exactly the same order, so they are
aligned. They have tuple alignment.
And then we can go ahead and apply the [inaudible] query predicate, the third
query predicate, and then we get the correct alignment for the result.
>>: So aligning [inaudible]
>> Stratos Idreos: Well, in this case you have to do some extra work through
alignment. I will show you how we minimize that.
Basically you always have to replay the tape if you are a bit, let's say, in the past
and the history it 10 queries -- you can run another 10 queries on another
column, you have to replay these 10 queries on the column it gives you now.
That's the only way those are aligned.
Other ways that we haven't really played with is that you can create copies,
multiple copies, and you have multiple synchronization points in there. But that
needs some extra storage.
>>: So is that the only -- let's say you don't want alignment. Can you sort it? Is
there another way of ->> Stratos Idreos: You can ->>: Is there a cost-based version of this? Let's say you have to catch up on too
many [inaudible]. Would you ever use the system to say not to alignment? Is
there even a feasible query plan that does not align?
>> Stratos Idreos: I see what you're saying. No, we don't have this option. So
we always do this kind of thing. We just make sure that -- and I'll show you later
how to do this more efficiently. Because if indeed the history is too big, then you
can have a big cost trying to align it correctly. But there are ways minimize that,
but, no, we don't have that choice.
>>: Do you ever do the normal thing of -- or say the historical thing, which is just
to sort the record IDs with the data?
>> Stratos Idreos: Yeah, yeah. But this is more expensive. So we've tried that,
sorting and also [inaudible] clustering, because [inaudible] clustering is cheaper
than sorting and still you get nice access patterns, but still that's more expensive.
So there are comparisons about this in this SIGMOD online paper.
Okay. So let me now show you how you approach these queries when you have
multiple selections.
So actually for multiple selections you can think of creating wider maps, but then
there's too many combinations of that. So what we end up doing is that using
these single map sets, which means that we have only these binary maps with
the same leading column and using big vectors to organize things.
So let's run again an example. We look at by example. We have these four
columns here, and this query that does free selections and non-projection. So
the first part, the first operator, will create map MAB and it will use the range
request on A, on column A, to correct this map.
So you reorganize this map based on A, and this is this organization. So this is
what A requested in the middle.
And then you have the tail, on the tail of the B values. And you can scan and
filter out which B values qualify, and then you create a bit vector which maps
exactly this year and points on the qualifying B values.
So the extra step here is that we have to scan the tail for this particular area and
create the bit vector.
Then the second operator will create map MAC if it doesn't exist, and then first
thing that it will do, it will again run the A predicate to correct this map. So again
you create this area here based on the predicate of A, and then this bit vector
here exactly shows which C values qualify from A and B.
And then you can go and take exactly these C values, and if they qualify for the
new C predicate, then you have to update the bit vector.
And the last operator, to answer this query, this is the operator that wants to take
out the D values, it uses map MAD. Again, it will crack MAD based on the
predicate on A, and then the bit vector that qualified from here reflects exactly the
qualifying D values. And then you have the result.
And [inaudible] you can use the index to infer some optimization information like
which map set you want to use, because here you can start -- here I'm using the
map set of A, so the map sets that always have A as the leading column, but it
might be more beneficial to use the one of B or C. So if you actually have one of
them around, you can do a quick search on the index and see which one has
more partitioning information, say, or infer selectivity and things like that.
>>: [inaudible]
>> Stratos Idreos: Yeah, that's what I said now. If you have that, if you have the
BC, BA maps laying around, then you can do a quick search on this index and
see which ones you want to use, but if you don't have we just randomly create
those.
Okay. Let me quickly talk about partial cracking as well. So normally we create
these full column maps, and then if we [inaudible] we have to drop them and
create new ones.
So what we do is partial cracking is that we basically partially selectively
materialize these cracking maps. And this looks a bit like this. So you selectively
go ahead and create and drop little pieces of these maps depending purely on
what the queries -- exactly you take out the values of the queries one.
And here's a representative performance graph where we have a limited
threshold and we continuously change the workload, which means that you have
to drop some of your indexes and create new ones as the workload changes and
you don't have more space. And the x axis indicates how often the workload
changes, so on the right is when you change basically for every single query, and
the y axis is the community response time to run thousand of these queries, and
the blue [inaudible] is the full maps which is -- which suffers quite a lot because
you have to drop the complete index, create the complete index again and index
it, start cracking from scratch, while for partial maps, of course, you have a much
finer granularity, you can be much more effective in there because you don't
have to drop the complete indexes. You hit the thresholds much later and so on
and so forth.
Okay. And, again, a more fine-grained example of how this looks like. So,
again, we have our columns, and this is how a partial sideways cracking index
look like. So here we go ahead and materialize only the small piece that the
query requested and then we crack and we index this small piece and we have
enough information to know what we've indexed and what we haven't indexed so
far.
Then when queries come that want to fine-grain this index information, we'll go
ahead and reorganize the small pieces. When queries come and request even
more data, we'll have to fetch this data, but we'll have to make sure that
everything works alignment as well, so we have to take out of this [inaudible]
mark, which is a [inaudible] model that coordinates whatever happens with all the
maps with the same leading head.
And, again, when we have different attributes requested, again, you have to
make sure that everything is synchronized, and also different maps can have
different portions of data materialized.
So that will be the last slide for technical stuff. I will show you how partial
alignment works.
So in general, you have all these pieces, and the red part here indicates what the
actual query requests, a range request. You have to crack at most the two
pieces at the edges of this range request because everything between will be
okay for this range request. So think of this as a column and it's ordered and
these are pieces. Inside each piece there's an order, and a new request comes,
let's say it should be here, so we have to crack these two pieces.
Now, if you think of how this happens with multiple map sets where you have to
also worry about alignment, well, the first observation is that you don't have to
worry about, I mean, thinking in adaptive way why you should align pieces of the
data that are not used in this query.
So first thing that we do is we go and align the pieces that are actually used in
this query.
Second thing is that you only need to do full alignment for the actual pieces that
you're reorganizing because only these pieces are affected by [inaudible]
organization. So fully alignment means to have the play the whole history.
And everything in between you only need to do partial alignment, and partial
alignment means that you only have to make sure that for these particular
columns that you're going to use for this particular query for this particular value
range, then they are aligned. So basically the you take the set of columns that
you refer to this query, you find their common place in the history, and you align
up to this point.
So then alignment looks a bit like this. Depending on how often the workload
changes, if it changes too often, then it doesn't make any difference, but if it
changes in big steps -- so this is every hundred queries, every 200 queries, then
using full alignment you can see the spikes, but if you use adaptive alignment like
this, then things are becoming more smooth.
And here is a more representative graph of the whole performance, and this is
TPC-H query. X axis is running multiple instances of this query, random
instances, and the y axis is response time.
So the red curve is plain MonetDB, the green curve is selection cracking, and the
purple curve is presorted MonetDB, so creating the proper -- the perfect
projection, column sort projection, which takes anything between three and 14
minutes depending on the TPC-H query. But then, of course, performance is
extremely nice.
And the blue curve is sideways cracking. So you see that the first query with
sideways cracking takes only less than a second, basically, while in this case it
takes whole minutes. And already after five queries, performance reaches
optimal performance, okay? It nicely approaches the performance of the fully
index.
And then already after a few -- 10 queries, 15 queries, performance nicely
stabilize at these levels, which is because the [inaudible] is quite astute in this
case. You always request similar ranges, so the index nicely adapts very fast
after a few queries.
So without having to know, like in here, without having to, you know, doing it in
actual time and having to know which queries and which columns are useful, you
can have excellent performance.
And now I'll close with a couple of slides of what is next, basically, and this is also
related work, and most of the work on indexing has come from this building or
this area.
Now, what cracking brings in this area of physical design is that it pushes all this
functionality inside the kernel and it proposes a way how you can do this in the
partial and incremental way inside the kernel without doing [inaudible] tools.
So if we see this from a high point level of view, so this is how offline indexing
works. You do things one step at a time. So first you do the analysis, then you
do the index building, then you do the query processing.
In online you can mix things. So while doing query processing you can do
analysis and index building, but then query processing, of course, suffers.
And adaptive indexing, there's no separate steps. Everything happens in one
thing, one step inside the kernel.
And regarding -- we can see also this chart regarding idle time and workload
knowledge. So offline indexing, it's a lot of idle time and a lot of workload
knowledge, online indexing, it's somewhere in between because you can do
things while processing queries, while adaptive indexing basically zero time and
zero workload knowledge.
And now I would like to end this with what I think is a bit of nice future work.
Basically I'm not claiming that cracking is the ultimate solution, but using all these
techniques and all the capabilities that you get from each one of these
techniques, you might be able to query the system that -- you know, it exploits
every bit of idle time that you have, it exploits every single bit of workload
knowledge that you have, either that you create or get this knowledge offline or
online, and you also does things partially inside the kernel and adaptively such
as it doesn't impose any overhead at query time.
And these are some of the open topics that are currently open. And the bold
ones are the ones that I'm actually quite active in working right now, disk based
cracking, concurrency control, and auto tuning tools.
And for the last slide I would like to point you to what I think is two very nice
papers so regarding initialization [inaudible], it's not really about indexing. The
first thing you have to do is load the data inside the databases. Again, think
about examples where it really makes a huge difference to be able to query your
data very fast.
And we have this very nice, I think, paper with people from EPFL where we take
this -- similar ideas like we do with database cracking, but we move them to the
loading part. So how you can query a database without even the data being
loaded and adaptively bring the data inside of the database, index it and develop
it and forth based on the queries.
And then even after you've done the loading, while you're dealing with [inaudible]
sets you not only correct answers. So we have this nice paper in PVLDB which
also won the best paper in the vision sessions, but we say how you can adapt
the actual kernel in the same way that we do with the database cracking, bring
the functionality of doing [inaudible] processing, build it all the way down to the
database kernel. So how you can change, let's say, your algorithms for
approximate query processing without doing any [inaudible], let's say.
Anyway, I think this is our indexing papers, if you want to take a look, and with
that I'm going to close. Thanks.
[applause]
Download