Jonathan Goldstein: Hello. My name is Jonathan Goldstein. ...

advertisement
Jonathan Goldstein: Hello. My name is Jonathan Goldstein. And I would like to introduce Milos Nikokic, who is visiting
us from EPFL in Switzerland. He is going to be talking to us about high performance incremental query processing. He
was an intern here last summer. So without further adieu, I'd like to introduce Milos.
Milos Nikokic: Thank you, Jonathan.
Okay. Before we start, like just say a few words about me. I'm a six-year Ph.D. student in the [indiscernible] EPFL
working on ->>: [Indiscernible].
Milos Nikokic: It's on. It is on.
Jonathan Goldstein: Oh, it is on.
Milos Nikokic: Okay. Yeah, so I'm a six-year Ph.D. student at EPFL. I work in the data lab under the supervision of
Christopher Koch. And as Jonathan said, I spend my last summer here working some exciting stuff on adding support for
DSP operations in Trill. And, yeah, I would like to thank Jonathan and Chris for the great summer that I had. And I would
also like to thank you for this opportunity to present you my work.
So my research focus is mostly on data, is probably on data management systems; more specifically on stream processing
and incremental computation.
And today I'm going to talk about how to build efficient incremental query processing engines for complex analytical
queries.
So the world is increasingly going digital with data being generated from many different sources. For example, sensor
networks in the context of the Internet of Things or social networks, but platforms for mobile devices. And for some
massive amounts of data being generated online, it is increasingly important for companies to have to react quickly to
events that can effect their business decisions.
So many of today's realtime applications require a realtime analytics of a large and dynamic datasets. For instance,
[indiscernible] analytics in the examples like [indiscernible] analytics or sensor network or high performance incremental,
high performance or Cloud monitoring, are all instances of applications that require some sort of realtime analysis.
And important for these applications is that the more of some sort of the state of the underlying data. And this state is
usually defined as views or aggregator presentations of the underlying data.
So at a very high level, the general monologue of these applications can be described as follows. As input systems, they
receive a stream continuously of arriving data. This could be software updates, or this could be sensor readings or user
clicks. These events are processed by runtime dimension that continuously valids certain views over this data and produces
results for the decision support system which makes corresponding actions.
For this applications, data changes at very high rates. So it is of crucial importance for users to be able to have frequently
fresh views of the data in order to be able to promptly react to certain conditions. So it is of the key importance for this
application, for these runtime engines to provide answers in a timely manner, low latency.
Additional requirements is that users should be able to express complex analytical queries in order to discover data, in order
to discover insights from the data use complex analysis.
And finally, big data applications requires scalable processing in order to handle more, in order to speed up query evaluation
or to handle growing memory requirements.
So what of the existing systems we can use to build such runtime engines? When we talk about streams of data, the first
thing that comes to our mind is stream processing systems. So typical classical stream processing systems process a stream
of events using continuous queries over windows of data, where windows is just a synchronous event. As the windows
advance over input stream, the engine recomputes the results for each window.
Now, the stream processing engines typically provide low latency processing of data. However, they often lack the support
for compressing complex analytical queries. For instance, queries that can handle massive aggregates. But in addition,
since this engine worked overlapping datasets of data, they're often based on the notion of incremental processing.
So the idea of incremental processing is the following. We are given a database and the stored material is a representation
of some computation. Now, when the database changes, our goal is to update the final result. However, we don't want to
recompute everything from scratch because that might be costly. Instead, what you like to do is to derive a so-called delta
computation, which is usually structurally simpler and works with less data and allows us to express what has to be changed
in the output given the current state of the database and the incoming update.
now, the idea of incremental procession has been successfully applied in many commercial database systems in the form of
incremental view maintenance. Here for a given materialized view defined as a SQL query over a set of base relations, the
systems derive delta queries for updates to each basic relations and execute those delta queries upon every change of the
database. And the execution of these delta queries is done using the classical query classification engine.
So both stream processing systems and database systems have some sort of incremental processing in it. So let's see how
well they perform in practice. So we run a benchmark here where we compare a commercial database system and a
commercial stream processing system. We compare on a set of TPC-H queries or the [indiscernible] is a well-known
benchmark for evaluating data systems. And on the X axis we evaluate the set of TPC-H queries ranging from those queries
that have from five queries that have only multiple joins, all the way to some queries that some [indiscernible] aggregates in
it. What we measure here is the single-tuple performance, meaning that after insertions or, each insertion or deletion into
the base relation, we're doing our best at maintaining the result fresh. So we have stream of events. So in this case the
stream of events is generated from a TPC-H database that consist of interrelated insertions and deletions.
So the performance for these systems is not so great. So both database system and the stream processing engine is process
roughly up to 100 events per second. That means after each event we want to maintain the fresh result. Are these numbers
good enough for realtime applications? Well, one can say probably not because we are seeing a much more velocity in the
data nowadays.
So in this talk, in the first part of this talk, I'm going to present a system that's called DBToaster which combines
incremental processing with code generation to produce efficient light-weight stream processing engines that are tailored for
the given query workload.
And as you can see from this performance graph, DBToaster can achieve up to five orders of magnitude better performance
than this commercial database and stream processing systems.
>>: [Indiscernible] it's only 10 to 100 per second. Why is it that [indiscernible].
Milos Nikokic: Sure. For database systems, basically while they're not made for this kind of workloads where you have
frequent insertions and deletions and you want to maintain some of the result. And in fact, in many cases incremental view
maintenance database systems was often slower than the complete [indiscernible] evaluation. Because of all sorts of
overcasts that are coming from components that are not involved in query processing like concurrency control
[indiscernible] management, buffer management, things like that.
For stream processing engine, the one that we used in our benchmark, basically we were not able to explore the initial
mechanism of this stream processing engine, because they were tied to the window semantics that they're meant for this
kind of application. But here we have a long-lived data where data cannot be captured by a single window. So what we had
to do is to internally maintain in-memory tables and then do incremental re-evaluation as much as we could.
>>: So how big a system were you running doing this test, these tests?
Milos Nikokic: So this was a single node experiment. The test data was not that big. It was actually a stream of insertions
and deletions which are synthesized from a 1 gigabyte TPC-H database. And in many cases, like those DBX and stream
processing were not able to process the whole stream within a one-hour time limit.
>>: [Indiscernible] hardware platform was, were you using a multicore, new motioning?
Milos Nikokic: For DBToaster, it's single core execution, single threaded, single core. For database systems, it was a
multicore, I think. But I'm not sure that it actually was using something. So for DBToaster, it's for sure single threaded,
single node performance.
Okay. So the goal of my Ph.D. thesis was how to build efficient incremental processing engines for complex analytical
queries. And in this talk I'm going to focus on two types of analytical queries. First I'm going to talk about traditional SQL
analytics, which are much used in all of systems. And this analytics typically involve aggregates over multiple dimensions
of the underlying data, or might involve queries with nested aggregates.
In the second part of this talk, I'm going to talk about a more complex analytics which can be expressed as linear algebra
programs. And these analytics allows us to actually discover some insights from the database using algorithms and models
from machine learning scientific computing or data mining.
So I will start with talking about how to incrementalize SQL queries. So DBToaster for database queries. It takes as an
input a SQL query and produces a lightweight engine, stream processing engine, that efficiently maintains the query result
upon every change of the database. The compiler consists of two parts. The front-end part transforms SQL queries into
maintenance programs where each maintenance programs consist of trigger functions describing what needs to be done in
the database on every update of the base of relations.
Yes?
>>: So what class of queries can you handle?
Milos Nikokic: For incremental view maintenance? So for efficient, so all flat queries and [indiscernible] queries where
the outer query has an equal relation with the nested query. Right? That we can handle efficiently. In all other cases, we
can fall back to reevaluation. And the whole, and the DBToaster in its whole can support the full, well, almost the full SQL.
We don't support nulls, we don't support [indiscernible] by sorting. But everything else should be in there.
>>: [Indiscernible].
Milos Nikokic: Yeah, we can handle. I might defer the question to how to handle this nest of aggregates to the Q&A
session. But we can do that.
>>: Those things you cannot handle, how many of them are intrinsic to your approach? Like it seems that the null you can
probably add.
Milos Nikokic: Sorting all, well, that's going to be costly, right. Obvious there's one more thing. We cannot handle mean
and max efficiently. For the moment we're rewriting those as nested queries. Like there is no such element that isn't greater
than this. But in general, for this case mean and max, we will have to maintain basically the whole base relation and then
incremental maintain. There might be some special cases if you just have the insertions, but it's not that interesting.
Okay. So the front-end part is the one that transforms SQL queries into maintenance triggers. And the back-end part takes
those maintenance triggers and generates Native code. For the moment we support C++ and Scala code generation for
single execution and Spark for distributive execution.
Okay. This generated code is self-contained. It doesn't require any database sitting behind it. It can be easily embedded
into these kind of realtime applications. Because we're just producing the source code that consists of trigger functions. Or
it can be used in other cases up to the user convenience.
So how do we go from these queries to maintenance programs using classical incremental view maintenance? Well,
typically, we have the following three steps. We're given a query and the first thing we want to do is we want to materialize
the results of that query. Then we can derive out the queries for updates to base relations. Once we have those delta queries
we can optimize them, and once we optimize them we can produce trigger programs. And everything is done at compile
time.
>>: Question. You do this at [indiscernible] time or do you batch?
Milos Nikokic: We can do both. And I'll have some discussion about the tradeoffs.
So this is a classical incremental view maintenance. Now, the problem is these delta queries. Although they are simpler
and usually for single-tuple updates, they have one join less and work with less data. However, they're not free. You still
need to execute them. Right. If you have an [indiscernible] drawing, a delta query still needs to join together in minus 1
relations. So the key inside the DBToaster is to repeat this procedure recursively.
Now, once we have these delta queries, we can materialize the updating the main part of each delta query, and then perform
incremental view maintenance for that delta query creating higher orders of delta functions. And with each duration step it,
these delta queries will become simpler and simpler until we reach, until we get a delta that doesn't depend on the database
anymore and includes delta zero. At that point we can stop.
So the overall idea of this recursive approach is to create a hierarchy of these materialized views that are going to maintain
each other. Right. And it's first going to appear contraindicative that, okay, we are storing more data. How is that going to
help us to achieve better view maintenance, right? But the key inside is these additional auxiliary views are actually helping
us to significantly simplify the maintenance job that we have to do for each of these queries.
So how do we do that? We start from the top. We have our query, and then for each base relation we define, we compute
first order deltas. And then for each delta we materialize the updating independent part of it. And then for each delta
materialized delta query, we're going to compute the second-order delta for its basal relations. And we continue with this
procedure recursively until we reach the zero deltas at the end.
now, we can use these deltas up to date to maintain the views of the parent, to maintain parent use. So the second order
delta might maintain the first order delta, which might maintain the final query. So that way we create a [indiscernible] of
dependencies between those materialized views. And all of those materialized views are self-maintainable. Meaning there
is no need for some additional database system that will help us to maintain these queries.
So let's see how this compilation procedure works on a simple example. So here we have a two-way simple query, which
joins together two relations, R and S. And has an aggregate on top of that. So in this particular case, what you want to
produce are two trigger functions that will tell us what we need to do, how we're going to maintain the final result upon
every insertions into one of these base relations. For the moment we're going to consider just the insertions, but we can also
similarly construct the different programs for deletions or for arbitrary updates.
So let's see how we're going to do that. The result of this delta query is just a single value so we're going to materialize this
single value using variable Q as the first step of our procedure. And then we consider one new update to the relation S. The
new result for this query is exactly the old query where the relation R is replaced by a union of R and delta R. And slightly
use the SQL [indiscernible] here, hopefully you will get, you understand the point.
However, we don't want to reinvent this query from scratch. We would like to derive a delta query which will help us to
reuse the previous result. In this case the delta query is much simpler. It's the original query where R is replaced by delta
R. Now, we're going to use this delta query to incrementally update the final result.
Now, since this input change is very simple, we can actually explore the simplicity to replace these variables A and B that
come from delta R with the actual constants delta A and delta B. Once we do that, we don't have the join inside this delta
query.
Now, since delta A doesn't depend on the S, we can use the distributivity of addition to pull out this constant in front of this
delta query.
What we have here is now a delta query that depends only on the relation S. But it has this input variable delta B. Which
we are going to get rid of by residing this query using [indiscernible] and following by a lookup into this materialized result.
Now we reached the end of the first cycle where we are going to recursively repeat this procedure. We have this delta query
which depends only on S, right, and we're going to materialize this. Since it's a hash map for storing the materialized
content on this.
>>: So if [indiscernible] key, wouldn't you be materializing a lot, because it's leaning on the datasets? If you were to join
on the keys, then this group A would be because of the datasets.
Milos Nikokic: Yes.
>>: So you would still materialize it or you have some [indiscernible].
Milos Nikokic: Of course. Yes. And we'll see why. Once we materialize this, now we can, suppose we have this
materialized result, we can rewrite the delta query as a simple lookup operation into this materialized view, materialized
map, followed by multiplication. That is our delta query that we are going to use to update the final result.
However, this new view that we just materialized needs to be maintained. Right? So on every update to S, we need to
compute a delta query for this materialized view. And in this case the delta query, because we are considering only singletuple updates; just a simple constant, delta C, which is going to update only one element of the materialized view.
Okay. Now, we created one additional materialized view, and we created some statements in our trigger programs. Now
we can proceed again by proceeding from the top level query, and considering updates to relation S. If we do that, we are
going to, in a similar way, materialize one additional relation over, one additional view over relation R and add two new
update statements. And that's it. That's the whole compilation procedure. And these are the trigger programs that we are
generating.
Notice the following. We materialize two additional views, right, but these views are not the whole base relations. These
are group-bi's over certain columns that are using that query, right. So we project away all unused columns and aggregate
over the ones that are used. So in case of Bs is a primary key, right, we want to reduce the number of tuples, but we want to
project away the A column because we don't need for this maintenance.
So the important thing about this, this generated programs, is that these triggers run in constant time. In this particular
example, if we have updates to relation R or S, we just need to perform a constant amount of work in order to update the
result Q and all our materialized views.
So in general, our trigger programs might not have a constant turning type. But in many other cases, it provides some
benefits that cannot be achieved through other incremental view maintenance techniques. For example, classical
maintenance view technique cannot incrementally maintain the result of this query in constant amount of work.
Okay, once we have these trigger programs, the second phase of our compiler is the actual code generation phase. And here
we're going to use a specialized data structures for storing the contents of these materialized views. In our case, since these
maps represent group by queries, we're going to use a hash map with memory-pooled records for storing the contents of
these two additional views. And we are going to specialize the result, the final result, the type of the final result, to be a
primitive type.
Now, for generating the actual trigger programs, we just need to transform these maintenance statements into the actual
code. In this particular case it might seem straightforward, because they are very simple. But in general they might involve
also some joint processing in the actual code generation.
Okay. So how does this perform in practice? Well, this is the graph that we already seen. Except that we have another
column here which does, which is DBToaster with the classical incremental view maintenance. So basically with that
additional figure we want to measure the impact of our recursive incremental compilation algorithm.
So as you can see, just by using DBToaster in code generation we can get one to two orders of better performance than
existing on commercial database systems. But then if we just use this recursive incremental view maintenance approach,
we can get up to five orders of magnitude better performance on average than# these systems.
Yeah?
>>: [Indiscernible]?
Milos Nikokic: So the memory overhead of using DBToaster obviously has some minor overhead. Pretty much depends on
the query that you are targeting. But let's say considering typical star schema query, all the materialized views cannot have,
will not have more than, more tuples than the fact table. That's for one. And in practice, they will always, they will usually
have much fewer tuples. Why? Because we are going to include a static conditions for a given query whenever it's possible
to filter out tuples that don't match static conditions. That's one.
The other thing is we are going to project away columns that are not used in, for that particular query workload. So we
don't need to sort the whole materialized relation, the whole base relation. And the third one, we are going to aggregate
over those material, these material views are aggregate representations of base tables. So if we aggregate over some
columns with very small domain, right, that might be very, that might have very small memory [indiscernible].
But in the case of -- I say TPC-H queries. It's usually much lower than the size of the base relations.
>>: Couple of questions. So one is, so the idea that you can materialize additional views for the sole purpose of speeding
up and pretty much maintaining [indiscernible]. This is one idea that you have. Did I misunderstand?
Milos Nikokic: No.
>>: So a few years ago there was a series of work done by [indiscernible] and others which essentially puts you with the
same idea. So can you comment on the [indiscernible].
Milos Nikokic: Sure. So the problem of view selections and using these views to answer some other queries, right, it's
nothing new to, it's something that is invented by DBToaster. It has been before. What's known in DBToaster is this
recursive incremental view maintenance where we have a way of transforming one query into delta queries into hierarchy of
delta queries which maintain each other results. Right. Yes, that might be also possible to achieve before that, right.
But there's also nice theoretical result which was not done by me, right, which tells if you use this recursive view
maintenance technique, then every materialized view, for every materialized view that you have in your system, you just,
for every value inside, you just need to perform a constant amount of work to update that value. So that means that this
hierarchy of incremental, hierarchy of materialized views is essentially embarrassingly parallel. Right. Whether you can
exploit that in practice, that's not a question. But it's nice theoretical result that separates classical incremental view
maintenance from recursive incremental maintenance.
>>: The question, the speed up that you get, can you break down how much you get from the code CodeGen versus -Milos Nikokic: Yeah, so that's the DBToaster with I VM. So DBT I VM, that's just code generation with classical
incremental view maintenance. And DBToaster is code generation with recursive incremental view maintenance. So as
you can see from this graph, using code generation and classical incremental view maintenance, can you get one to two
orders of magnitude better performance. But the real benefit comes from using this recursive incremental approach where
you can get up to five.
>>: [Indiscernible]
Milos Nikokic: So can we use other drawings if we don't support null at the moment. So we actually had to change some
of these queries to replace other outer joins with inner joins and to avoid, and to remove order by clauses. Because we don't
support sorting.
>>: [Indiscernible].
Milos Nikokic: It is a problem. Because you would have to maintain the whole, it's like mean and max. You would have
to maintain the whole history of this base of values. Right. Which we don't support for the moment.
>>: [Indiscernible]
Milos Nikokic: We have the full TPC-H benchmark, although we modified some of them of course, right, to avoid these
restrictions. But we don't support them. So we have results for TPC-H and also for some TPC, DC, DS queries.
>>: So do you have a graph for the memory utilization for these systems?
Milos Nikokic: No. I don't have a slide. We have it in the paper.
>>: What does it look like?
Milos Nikokic: Well, depends on the query; right. For some queries it can stabilize. Because you materialize the view and
you have seen all distinct values from your keys, right. So whatever you aggregate on top of that it won't increase your
memory consumption. But there are queries for which, as you keep seeing new data, new keys, new primary keys, the
memory consumption will grow, right. But that's the same case as with other systems. So the relative ratio, I would say
that for this kind of workload is that the size of this materialize views is usually much smaller than the size of these base
relations. Just because we project away columns. We just keep only the minimum amount of data that 23 need to maintain
those queries.
Okay. So, yeah, one can see this is for single-tuple processing. What about batch processing? Database systems are much
better for doing that. So we did an experiment where we use a commercial database systems and we varied the size of the
batch and the size of the input batch that we are processing. And we compare incremental view maintenance
[indiscernible]. We can see that as we increase the batch size, obviously as expected, the number of processing tuples per
second will also go up. However, even for very large batch sizes, these performance numbers are still around 300X slower
than DBToaster with single-tuple processing for this particular query. Other queries might differ, but still there is at least
one or two orders of magnitude better performance with DBToaster.
Now, the question is can we use batching inside DBToaster and get similar performance in programs? Well, the answer is
yes and no. So there is traditional belief that batching will always improve performance. And that's true, because it might
better cache locality. Might also amortize invocation cost these trigger functions. And batching is also crucial for
distributed execution. Because we want to amortize natural communication costs.
So how do we support batching or how do we do batch procession inside DBToaster? So we're again considering the same
query, right, a simple two-way join with an aggregate on top. For that delta query, we're going to materialize the result in a
variable queue and then we're going to derive queries for updates to relation R. Right, the delta query is simple enough we
just replace R with delta R. Then we can do some traditional query processing with optimizations like pushing the
aggregation on the expression tree in order to minimize the amount of data that's been interjoined. And in this case we can
explore the simplitivity of this sum operation to push the sum down the expression tree and compute aggregates over delta
R and S.
And if you notice, the right upper end of this join tree is exactly the materialized view that we also had for single-tuple
execution. So we're going to materialize this query and we're going to repeat the same procedure for updates to another
relation, S. And in this procedure we're going to materialize an aggregate over the relation R.
Now we have two additional maps, like before, right, that one of them is over S, one of them is over R, and the final is on Q.
So an update, a trigger update for R, we need to maintain Q and MR. The delta query for this MR is a simple aggregate
over the delta R input batch. And if you notice, that is exactly the same expression that is used for maintaining Q. So we
can exploit some commonalities between those delta expressions to actually reduce some computation and evaluate this
delta expression sufficiently.
So in general for this query if we were right to, if you are going to write a trigger program for this query, it consists of three
steps. First we're going to pre-aggregate the input batch that we are going to get. Second, we are going to use that batch to
compute the delta Q. And finally we are going to use pre-aggregate the batch to update the materialized relation.
These are three basic steps. You don't need to understand them. But the main point here is that we can pre-aggregate the
input batch, explore the fact that we have a collection of tuples. We can try to reduce their size as a first step. And then use
this aggregate representation in to further computation of R delta programs.
Now, how are we going to translate this program into C? Well, for pre-aggregation, we had to use intermediate collections,
right. That involves looking over the input batch and aggregating the results. For updating Q, we're going to value this join
as a N memory has joined, where again we're going to loop over this aggregate representation of batch. And for updating R,
it's just looking over this pre-aggregate batch and updating the final result. Again, there's no need to understand this code.
But this is what a trigger program would look like in C if you were to compile this input query.
Now, compare this batch version versus the single-tuple version. Obviously there is the single-tuple version is relatively
simple. It would [indiscernible] statement, where this batch version has much more, performs much more computation for
the given input and the size. Notice how there are all differences here. First the single-tuple version has specialized
primitive-type trigger parameters. Whereas in the batch case we have at least a generic least of tuples. That is actually a
powerful optimization that allows the C++ compiler to treat this delta N, delta B as constants and move them out of loops
during this compilation to machine code.
Second of all, the single-tuple version doesn't have any intermediate results. We can eliminate those loops as well because
we know that the input [indiscernible] sizes of size 1. And we can also do partial revelation, inlining, and things like that.
Right? So if we were to specialize this code, we'll get the single-tuple code.
Now, the question is, which one is going to perform better? Right? The single-tuple code obviously has better instruction
cache locality, because it's much more. The batch might have better data locality, because it's going to iterate over# these
collections of data, right amortized loops. And the batch version can also profit from this pre-aggregation from this first
step because in this step we can in some queries, we might reduce the domain to relatively few values and then make
subsequent computation very cheap.
So we did an experiment comparing the single-tuple performance with batching performance for the full TPC-H benchmark,
all 22 queries. So what we did here is we measured the performance of single-tuple processing and we measured the
performance of batch processing for different batch sizes. We varied batch sizes from 1 to 100,000. And for each query we
picked the best batch size and the best throughput numbers that we get. Right? And then we normalize these numbers by
the single-tuple performance. That's what is shown on the X axis. And the Y axis shows the cumulative count for these
queries.
So let me explain this curve. So what we have here, what we can see from this graph is that for five queries, the relative
ratio of this throughput is less than 1. That means the single-tuple procession is better than any batch processing for any
batch size
>>: For batch size 1, because you have [indiscernible].
Milos Nikokic: Yeah, for batch size 1 it's always better.
>>: Yeah.
Milos Nikokic: But even if you increase the batch size, that pre-aggregation step doesn't help you that much. You might be
pre-aggregating over primary keys, so want to reduce the size of the input batch. And you won't see any benefit. In fact,
you will see worse performance.
For like half of our workload, we can see that single-tuple performance is maybe 20 percent worse than at most 20 percent
worse than batch processing. For 40 queries, it's like 50 percent.
But notice also that for this best batch throughput, for different queries, that might be different batch sizes. So if we have
just one batch size these numbers might be even worse and might be even better for in favor of single-tuple processing.
However, there are certain queries where batch processing can actually deliver very great performance. And for like four or
five queries, we have seen improvements after 4,000 x. And this is mostly because of batch pre-aggregation, which allows
us to reduce the domain to very few elements such that some sequence maintenance statements are relatively cheap.
Okay. What about distributed execution? Well, the goal of this part is to transform this local incremental programs into
distributive programs that can run on large-scale processing platforms. And we do that for two reasons. We want to speed
up query [indiscernible] or maintenance work and we want to accommodate more data.
Now, one crucial difference between this board and classical query optimization is that we're trying to optimize, we're trying
to distribute this incremental programs. These programs consist of triggers, and each trigger consist of statements. And
these statements, we have data flow dependencies, like in any other program. Right? So for example, delta in the first
statement might be using views that are maintaining the second statement or in the sixth statement. That means we cannot
arbitrary reorder them. So we have to be careful about how we're going to inter leave the execution of these statements in
this environment.
So that is the reason. So our first attempt was actually to execute these trigger programs on a distributed system with a
synchronous execution model. And that turned out to be quite bad, because of this synchronization is that we had some
computation that we had to introduce for dealing with this, for ensuring the consistency of these programs.
So our approach is to distribute this incremental programs using processing platforms with synchronous execution models.
For example, to execute this program on top of Spark or Hadoop, right. So we assumed we have one master node with a
bunch of workers. And we represent this computation as a sequence of jobs. Now, in order to achieve these executions, we
had to change our language and introduce some location tags or to mark each expression with some information about
where it is located in the distributive environment.
So each expression can one of three possible values. For instance, it can be marked as local, meaning that the expression is
executed and the result is stored on a master node, can be distributed by key meaning that all the whole materialized view is
partitioned across all workers by certain key, or it can be distributed randomly, which is usually the case with input batches.
And we also introduce some communication primitives in order to achieve this distributing execution. And there's a very
well-known communication primitives that allows us to, for example, repartition one distributed collection or transform one
distributing to one local collection or distribute to the local collection and to across many workers.
So the way we do that, we start with a local statement, single statement. And we, for each materialized view in this graph,
we are going to annotate to say where the location of this result is. We assumed that such information is provided from the
outside, it doesn't matter in our case.
So let's say that delta R, the input batch is randomly partitioned across all workers, that the materialized view MS is
partitioned across by column B. And the final result is stored on a local node. So how do we build this distributed
statement? Well, we traverse this tree in a bottom-up fashion and try to introduce communication primitives where they are
necessary in order to preserve the semantics of these queries operations.
For instance, since we have to compute the join between those relations, they have to be partitioned on the same join key.
So that means that we have to repartition one of the input batches. And once they are partitioned by the same key, we can
perform local join operations followed by local aggregations, and then finally we can introduce one gather primitive to
collect the result on the master node.
So that way we can execute our local statement as a sequence of stages, and these stages can be either distributed stages like
happening in parallel on every worker, or they can be local stages which are happening only on the master node. We also
have some communication stages where we need to send data around and in order to ensure correct semantics of query
execution.
Now, that is how we transfer one statement. So these programs have multiple statements and we can simply concatenate
those statements saying this is the distributive program. From here you can see it's a relatively straightforward to generate
some code that would run on top of Spark. Or this code is going to be very inefficient because it requires all of the
communication.
So what we do in this stage, we try to reorder this. What we do in this phase, we try to reorder those stages and merge those
stages that can be merged that are compatible. For example we can merge these distributed stages or we can merge local
stages. And while reordering this, we try to preserve the statement dependencies that exist in the input program. Once we
have that optimized distributive programs, we can generate code for any system that has a synchronized execution model.
In our case we generate Spark code.
So we did some experiments. In this case we have a stream of 500 gigabytes, synthesized from a 500 gigabytes TPC-H
database. And the batch sizes fixed at 200 million. And what we measured here, we compare the time it takes to process
one batch with respect to different number of workers. And we have several queries here, starting from very simple one,
like query 6, which just requires a local pre-aggregation and one collection phase on the master node.
and as you can see with our implementation on top of Spark, we can go around as low as 0.5 seconds for processing this 200
million tuples using, I think, 800 nodes.
For other queries like query 3 and query 17, they require two communication stages. And they also require a certain amount
of data to be transferred. So in those two examples, performance is much higher. So the latency is a bit higher around 4
seconds. And this is mostly comes from the overhead of the underlying execution engine, in this case par. Since we have to
execute several jobs in order to maintain this query, the internal overhead of Spark are actually causing this latency. And
for some queries like query 7 where we have more than two communication stages, and in those communication stages we
shuffle a lot of data, the performance really suffers and it's around 32 seconds, flattens out with 200 nodes. And basically
this is something that we cannot improve using Spark. Because mostly the latency is affected by the garbage collection that
is happening on these worker nodes. And we don't have a way to actually change that. So building a specialized system for
these kinds of workloads would really help us here.
Okay. With that I would like to conclude this first part of the talk. And then I'll try to jump to the second part.
So the second part, we're going to talk about incremental processing of machine learning of linear algebra programs. So
many existing machine learning algorithms can be described as iteratively [indiscernible] programs. For instance, if you
take the ordinary [indiscernible] square method, which is a common way to fit the current data, it can be expressed as a
sequence of operations over matrices.
Now, in this case, the operations involve classical [indiscernible] multiplication inversion and transposed. Now, the
problem with this analytics is that we have to work with increasingly large datasets, which is becoming excessive. And this
data are changing, usually through small changes compared to the overall dataset size.
So we build a system called LINVIEW that performs incremental processing of incremental of iteratively knowledgeable
programs. So inputs to this system, we take APL-style programs, like MATLAB programs, Octave programs, and we
transform them into incremental engines that efficiently maintain the query result.
Now, as input to this system, we consider language consisting of standard matrix operations like multiplication, addition,
transpose, matrix inversion. And this language is sufficient enough to express many of the machine learning algorithms.
So let's see what it means to perform incremental processing in the context of linear algebra. So here we have an example
of matrix power computation where we want to compute the fourth power of again matrix S, which is a very common task
in many realtime applications. For instance, like in graph processing, to compute reachability between nodes or to end page
range computation.
So we want to express this computation in a sequence of two statements. And in this statement they involve full matrix
multiplication. In all of these cases they're considered dense matrixes. And the complexity of these matrix multiplication
with some specification, it's N to the cube. Now, when the input matrix changed, let's say one element just changes, we can
recompute everything from scratch, but that might be costly, especially if the updates are small. What we want to do
instead is to derive this incremental program which is hopefully simpler enough. And it will tell us how to compute only
the delta change in the result more efficiently.
Now, the linear approach is beneficial for linear algebra programs that involve matrix operations. Now, the complexity of
the original program has to be N to the cube. And for those kinds of programs, we're going to generate incremental
programs that will explore the sparsity of this input to achieve better syntactic behavior.
now, the question is, how are we going to derive this incremental programs, how are we going to evaluate this delta
expressions, and how are we going to represent them?
Now, for delta derivation, we're going to rely on the basic properties of matrix operations. For instance, if you want to
incrementalize this matrix multiplication, we'll just say that we're going to rely on the distributivity of matrix multiplication
with respect to addition to derive at delta B which consists of the sum of 3 terms. This is relatively straightforward.
Now how are we going to evaluate those kinds of deltas? Suppose that the input delta consists of single entry change. Now
in order to compute delta B, we're going to evaluate this sum of 3 expressions. And the final output will contain changes in
one row and one column. With these blue markers, I'm going to denote entries that might be potentially known zero entries
in delta matrixes. So we can see that for a single entry change in the input matrix, the result, the delta D, has potentially
changes in one row and one column.
Now, if we're going to push this delta to the second computation that computes the delta C, we can see that the whole output
is going to be changed. Well, that causes a problem. Because while it does [indiscernible] the first that incremental
evaluation has a lower bound offense square, so we had to change all the elements, so we cannot go below that. The second
thing is if you want to use this delta C in subsequence with the delta evaluations, we're going to end up doing full matrix
multiplications. So very quickly in just two iteration steps, incremental evaluation will lose its benefit over reevaluation.
And we call this the avalanche effect. A single local change has exploded in just two steps and contaminated the whole
output.
Now, how are we going to deal with this? Well, we just seen that if you represent deltas as single matrixes, it's not going to
work. Now, the main insight of our work is that all of these delta matrices might have known zero elements everywhere.
They often have simple structure, meaning that they can be decomposed and [indiscernible] as a product of two matrices,
two small matrices, that have low ranges. For instance, if you take this input delta input which has only one element
change, we can compress and represent the same change as a vector outer product, as a product of two vectors.
And I'm going to argue that this factor representation of these deltas can help us to achieve efficient evaluation.
Let's see how we can do that. Suppose that delta A is given input change and is an outer proved two vectors. It's called
rank 1 update. Now, when we want to compute delta B, which is, which we do in the same way as before, now we're going
to, instead of computing the outer product between those two vectors and then multiplying the matrix, that would cause us
N cube complexity, which we want to avoid. So what we want to do instead, we want to explore the associativity of matrix
operations to first perform the [indiscernible] between this matrix and vector. And once we have that, we can represent
delta B as a sum of two other products.
Now, if we wanted to update B, we can recompute the result. But if you want to propagate delta D further down the
expression, we are going to always use this factor representation as a sum of outer product, which will allow us to express
delta C as a form of 4 outer products.
What that brings us is that we can do that, now delta relation using only N square operations. There's no matrix metrics
multiplications any more involved in this incremental program.
So we can also support generally iterating models. As you can see with every delta propagation step, the size of sum of
Delta, of outer products, is actually increasing. But that's not the problem, because many practical algorithms, they often
converge in very few iterations. For example, [indiscernible] conversion less than thirty iterations.
So how do we incrementalize iterative models? Well, we started from the state. And we have an iterative function, which
creates a sequence of states. And we are going to now, when the input changes, we can recompute all the states again or we
can simply do incremental evaluation. We can materialize the internal, every intermediate state. Then we can derive delta
queries and we can use them to update the intermediate state of this iterative program.
So just to give you a short view of our results. Here I'm showing you different use cases of complex analytics. And they're
complexity in terms of the big annotation for reevaluation and incremental maintenance that is our approach. And as you
can see that in all of these cases we managed to bring down the complexity from N cubed to N squared, just by using this
factor representation and completely avoiding matrix operations.
The only case where we cannot go below, be better than reevaluation, is when the input problem doesn't consist, doesn't
involve any matrix.
>>: [Indiscernible].
Milos Nikokic: Yes.
>>: So can you build a model and incrementally.
Make it using the same [indiscernible].
Milos Nikokic: Yes, so this is general model. For example, you can fit linear regression model into this, express it as an
iterative program. Or any other program that you can compose using these basic matrix operations, which doesn't involve
vectors. Because matrix vector operation is N square and we cannot beat that. That's the lower bound. So if your input
algorithm involves matrix metrics operations, yes, you can incrementalize in this way.
Okay. So how well does it perform in practice? So here we measure the performance on top of Spark for up to 100 nodes.
And we measure the performance for doing the 16 power matrix operation computing at the 16 power of given matrix A,
the size of A is fixed. And what we measure here is how much time it takes to process one rank-1 update. That means, for
example, changes in one row or changes in one column. And as you can see that reevaluation is, as you add more nodes
right, then the time it takes to reevaluate is obviously going down, right. But it's still order magnitude slower than our
approach.
And in our case, it seems like that incremental [indiscernible] doesn't scale. Which is not through. Because for incremental
view maintenance, the input problem size is actually too small. And all of these numbers, but you can see below are just
overheads of the communication overheads from the underlying engine, Spark.
We also tried to fix the number of workers to 100 and then we varied the dimension size and measured how much it takes to
process one update. And as you can see the real evaluation is actually the cost is going exponentially. Or actually
[indiscernible]. But in our case, because so the reevaluation is going [indiscernible] because it requires reshuffling of the
whole matrix. In our case we're just reshuffling vectors, right, which is linear time operation. And we can handle much
larger matrices in this case.
Okay. With that I would like to conclude. So today I presented you two systems that I worked on. And the first one is
DBToaster which is a SQL compiler that produces lightweight stream processing engines which are tailored for a given
query workload. It is based on a recursive [indiscernible], and it supports single-tuple and batch execution in both local and
distributive environments.
In practice, DBToaster can achieve up to five orders of magnitude better performance than other commercial systems. And
actually DBToaster is released. You can go to our website and you can download and play with it.
The second part of this talk was focused on incremental evaluation of linear algebra programs where the main insight is how
to, where the main problem and challenge is how to contain this [indiscernible] where factor where single entry change can
quickly contaminate the whole output. To deal with this program, we introduced this factor, delta representation, which
allows us to contain the avalanche effect and achieve asymptotic improvements compared to [indiscernible].
Okay. Just a short about some possible future works. Right now we have seen that sensor data are generating enormous
amounts of data, right. And this date it is usually shift to the Cloud storage where it's processed by data practitioners. And
soon enough this approach won't be able to scale and won't be able to support large amounts of data that are being
increasingly generated by sensors. So what you want to achieve here is we want to push computation that is happening
right now inside the Cloud, we want to push it all the way to the edges of this sensor networks and perform both relational
operations as well as the main specific operations inside sensors before shipping the aggregated result or the valuable
insight back to the Cloud storage. Right.
And the work that I've done together with Jonathan and [indiscernible] here is one-step in that direction where we try to
introduce DSP operations inside a stream processing engine. And it turns out that it can be very successful. We can
leverage existing algorithms for implementing this system.
Now, the challenge is how do we do similar stuff can other domains like numerical computations, approximations,
statistical models, and things like that. So I believe this is the one challenging direction that we can take.
Okay. With that, I would like to thank my collaborators from EPFL from whom I worked on these two projects, also
Jonathan and Chris, with whom I worked on the Trill.
And yeah, I'm glad to take any questions that you have.
[Applause]
>>: So coming back to the first part of talk, the DBT server. So you were using [indiscernible] or processing there
[indiscernible]?
Milos Nikokic: No, DBToaster is just in single node.
>>: I see. So what are the bottlenecks? Was it the memory or algorithmic or was it the hash tag [indiscernible]?
Milos Nikokic: Yes.
>>: For the kind of [indiscernible] that you were looking at?
Milos Nikokic: Yeah, so it was the memory bottleneck. So we don't have a very good cache performance. Mostly because
it's a streaming workload and we need to do random lookups to clear out our cache. Right, we have to have good
instruction cache locality, right, which is very important for, because it's very hard to mask those kind of cache misses. But
for data locality, I mean in this experience scenarios we are doing our best. And our data structures are optimized for this
kind. But we still have kind of like 40 percent of cache misses just because we just have streaming data, which is clearing
out our caches.
>>: So the numbers that you project for the [indiscernible] a million tuples per second? [indiscernible] so does that
[indiscernible] directly to around a million cache look ups that you're looking at? Because reasonably you bought a
machine that you can get out a million hash tags per second.
Milos Nikokic: So we do, even, we do have better results than that. So for some queries I think we are going around thirty
million tuples per second. Yes, that would translate into, sure, into hash lookups that we're doing at least that many. And in
some cases, because we have multiple trigger statements, we are doing even more. Right.
>>: So more joins means more hash lookups.
Milos Nikokic: Well, we tried to avoid hash drawings. We try to avoid drawings because this recursive compilation will
often enough completely eliminate drawings from our creative processing. All we're doing is summing up iterating over
sum maps and as the updating of the result. We don't do any joint processing often enough. Not in all cases. So that's
basically one idea that with this recursive compilation you can get rid of this classical database operators that we have.
>>: [Indiscernible] that you're looking at. So that's the only one [indiscernible].
Milos Nikokic: Yes. Yeah, and our internal implementation is in memory hash drawing.
>>: So what you have done is you create a [indiscernible] join and the application and [indiscernible]?
>>: [Indiscernible].
Milos Nikokic: Yeah, these views are already pre-aggregated. So in many of these cases, you wanted to do any preaggregation on top of that. You just do a lookup and update the result.
Milos Nikokic: Do we have any more questions for Milos?
Okay then I would like to thank our speaker.
Milos Nikokic: Thank you.
[Applause]
Download