Jonathan Goldstein: Hello. My name is Jonathan Goldstein. And I would like to introduce Milos Nikokic, who is visiting us from EPFL in Switzerland. He is going to be talking to us about high performance incremental query processing. He was an intern here last summer. So without further adieu, I'd like to introduce Milos. Milos Nikokic: Thank you, Jonathan. Okay. Before we start, like just say a few words about me. I'm a six-year Ph.D. student in the [indiscernible] EPFL working on ->>: [Indiscernible]. Milos Nikokic: It's on. It is on. Jonathan Goldstein: Oh, it is on. Milos Nikokic: Okay. Yeah, so I'm a six-year Ph.D. student at EPFL. I work in the data lab under the supervision of Christopher Koch. And as Jonathan said, I spend my last summer here working some exciting stuff on adding support for DSP operations in Trill. And, yeah, I would like to thank Jonathan and Chris for the great summer that I had. And I would also like to thank you for this opportunity to present you my work. So my research focus is mostly on data, is probably on data management systems; more specifically on stream processing and incremental computation. And today I'm going to talk about how to build efficient incremental query processing engines for complex analytical queries. So the world is increasingly going digital with data being generated from many different sources. For example, sensor networks in the context of the Internet of Things or social networks, but platforms for mobile devices. And for some massive amounts of data being generated online, it is increasingly important for companies to have to react quickly to events that can effect their business decisions. So many of today's realtime applications require a realtime analytics of a large and dynamic datasets. For instance, [indiscernible] analytics in the examples like [indiscernible] analytics or sensor network or high performance incremental, high performance or Cloud monitoring, are all instances of applications that require some sort of realtime analysis. And important for these applications is that the more of some sort of the state of the underlying data. And this state is usually defined as views or aggregator presentations of the underlying data. So at a very high level, the general monologue of these applications can be described as follows. As input systems, they receive a stream continuously of arriving data. This could be software updates, or this could be sensor readings or user clicks. These events are processed by runtime dimension that continuously valids certain views over this data and produces results for the decision support system which makes corresponding actions. For this applications, data changes at very high rates. So it is of crucial importance for users to be able to have frequently fresh views of the data in order to be able to promptly react to certain conditions. So it is of the key importance for this application, for these runtime engines to provide answers in a timely manner, low latency. Additional requirements is that users should be able to express complex analytical queries in order to discover data, in order to discover insights from the data use complex analysis. And finally, big data applications requires scalable processing in order to handle more, in order to speed up query evaluation or to handle growing memory requirements. So what of the existing systems we can use to build such runtime engines? When we talk about streams of data, the first thing that comes to our mind is stream processing systems. So typical classical stream processing systems process a stream of events using continuous queries over windows of data, where windows is just a synchronous event. As the windows advance over input stream, the engine recomputes the results for each window. Now, the stream processing engines typically provide low latency processing of data. However, they often lack the support for compressing complex analytical queries. For instance, queries that can handle massive aggregates. But in addition, since this engine worked overlapping datasets of data, they're often based on the notion of incremental processing. So the idea of incremental processing is the following. We are given a database and the stored material is a representation of some computation. Now, when the database changes, our goal is to update the final result. However, we don't want to recompute everything from scratch because that might be costly. Instead, what you like to do is to derive a so-called delta computation, which is usually structurally simpler and works with less data and allows us to express what has to be changed in the output given the current state of the database and the incoming update. now, the idea of incremental procession has been successfully applied in many commercial database systems in the form of incremental view maintenance. Here for a given materialized view defined as a SQL query over a set of base relations, the systems derive delta queries for updates to each basic relations and execute those delta queries upon every change of the database. And the execution of these delta queries is done using the classical query classification engine. So both stream processing systems and database systems have some sort of incremental processing in it. So let's see how well they perform in practice. So we run a benchmark here where we compare a commercial database system and a commercial stream processing system. We compare on a set of TPC-H queries or the [indiscernible] is a well-known benchmark for evaluating data systems. And on the X axis we evaluate the set of TPC-H queries ranging from those queries that have from five queries that have only multiple joins, all the way to some queries that some [indiscernible] aggregates in it. What we measure here is the single-tuple performance, meaning that after insertions or, each insertion or deletion into the base relation, we're doing our best at maintaining the result fresh. So we have stream of events. So in this case the stream of events is generated from a TPC-H database that consist of interrelated insertions and deletions. So the performance for these systems is not so great. So both database system and the stream processing engine is process roughly up to 100 events per second. That means after each event we want to maintain the fresh result. Are these numbers good enough for realtime applications? Well, one can say probably not because we are seeing a much more velocity in the data nowadays. So in this talk, in the first part of this talk, I'm going to present a system that's called DBToaster which combines incremental processing with code generation to produce efficient light-weight stream processing engines that are tailored for the given query workload. And as you can see from this performance graph, DBToaster can achieve up to five orders of magnitude better performance than this commercial database and stream processing systems. >>: [Indiscernible] it's only 10 to 100 per second. Why is it that [indiscernible]. Milos Nikokic: Sure. For database systems, basically while they're not made for this kind of workloads where you have frequent insertions and deletions and you want to maintain some of the result. And in fact, in many cases incremental view maintenance database systems was often slower than the complete [indiscernible] evaluation. Because of all sorts of overcasts that are coming from components that are not involved in query processing like concurrency control [indiscernible] management, buffer management, things like that. For stream processing engine, the one that we used in our benchmark, basically we were not able to explore the initial mechanism of this stream processing engine, because they were tied to the window semantics that they're meant for this kind of application. But here we have a long-lived data where data cannot be captured by a single window. So what we had to do is to internally maintain in-memory tables and then do incremental re-evaluation as much as we could. >>: So how big a system were you running doing this test, these tests? Milos Nikokic: So this was a single node experiment. The test data was not that big. It was actually a stream of insertions and deletions which are synthesized from a 1 gigabyte TPC-H database. And in many cases, like those DBX and stream processing were not able to process the whole stream within a one-hour time limit. >>: [Indiscernible] hardware platform was, were you using a multicore, new motioning? Milos Nikokic: For DBToaster, it's single core execution, single threaded, single core. For database systems, it was a multicore, I think. But I'm not sure that it actually was using something. So for DBToaster, it's for sure single threaded, single node performance. Okay. So the goal of my Ph.D. thesis was how to build efficient incremental processing engines for complex analytical queries. And in this talk I'm going to focus on two types of analytical queries. First I'm going to talk about traditional SQL analytics, which are much used in all of systems. And this analytics typically involve aggregates over multiple dimensions of the underlying data, or might involve queries with nested aggregates. In the second part of this talk, I'm going to talk about a more complex analytics which can be expressed as linear algebra programs. And these analytics allows us to actually discover some insights from the database using algorithms and models from machine learning scientific computing or data mining. So I will start with talking about how to incrementalize SQL queries. So DBToaster for database queries. It takes as an input a SQL query and produces a lightweight engine, stream processing engine, that efficiently maintains the query result upon every change of the database. The compiler consists of two parts. The front-end part transforms SQL queries into maintenance programs where each maintenance programs consist of trigger functions describing what needs to be done in the database on every update of the base of relations. Yes? >>: So what class of queries can you handle? Milos Nikokic: For incremental view maintenance? So for efficient, so all flat queries and [indiscernible] queries where the outer query has an equal relation with the nested query. Right? That we can handle efficiently. In all other cases, we can fall back to reevaluation. And the whole, and the DBToaster in its whole can support the full, well, almost the full SQL. We don't support nulls, we don't support [indiscernible] by sorting. But everything else should be in there. >>: [Indiscernible]. Milos Nikokic: Yeah, we can handle. I might defer the question to how to handle this nest of aggregates to the Q&A session. But we can do that. >>: Those things you cannot handle, how many of them are intrinsic to your approach? Like it seems that the null you can probably add. Milos Nikokic: Sorting all, well, that's going to be costly, right. Obvious there's one more thing. We cannot handle mean and max efficiently. For the moment we're rewriting those as nested queries. Like there is no such element that isn't greater than this. But in general, for this case mean and max, we will have to maintain basically the whole base relation and then incremental maintain. There might be some special cases if you just have the insertions, but it's not that interesting. Okay. So the front-end part is the one that transforms SQL queries into maintenance triggers. And the back-end part takes those maintenance triggers and generates Native code. For the moment we support C++ and Scala code generation for single execution and Spark for distributive execution. Okay. This generated code is self-contained. It doesn't require any database sitting behind it. It can be easily embedded into these kind of realtime applications. Because we're just producing the source code that consists of trigger functions. Or it can be used in other cases up to the user convenience. So how do we go from these queries to maintenance programs using classical incremental view maintenance? Well, typically, we have the following three steps. We're given a query and the first thing we want to do is we want to materialize the results of that query. Then we can derive out the queries for updates to base relations. Once we have those delta queries we can optimize them, and once we optimize them we can produce trigger programs. And everything is done at compile time. >>: Question. You do this at [indiscernible] time or do you batch? Milos Nikokic: We can do both. And I'll have some discussion about the tradeoffs. So this is a classical incremental view maintenance. Now, the problem is these delta queries. Although they are simpler and usually for single-tuple updates, they have one join less and work with less data. However, they're not free. You still need to execute them. Right. If you have an [indiscernible] drawing, a delta query still needs to join together in minus 1 relations. So the key inside the DBToaster is to repeat this procedure recursively. Now, once we have these delta queries, we can materialize the updating the main part of each delta query, and then perform incremental view maintenance for that delta query creating higher orders of delta functions. And with each duration step it, these delta queries will become simpler and simpler until we reach, until we get a delta that doesn't depend on the database anymore and includes delta zero. At that point we can stop. So the overall idea of this recursive approach is to create a hierarchy of these materialized views that are going to maintain each other. Right. And it's first going to appear contraindicative that, okay, we are storing more data. How is that going to help us to achieve better view maintenance, right? But the key inside is these additional auxiliary views are actually helping us to significantly simplify the maintenance job that we have to do for each of these queries. So how do we do that? We start from the top. We have our query, and then for each base relation we define, we compute first order deltas. And then for each delta we materialize the updating independent part of it. And then for each delta materialized delta query, we're going to compute the second-order delta for its basal relations. And we continue with this procedure recursively until we reach the zero deltas at the end. now, we can use these deltas up to date to maintain the views of the parent, to maintain parent use. So the second order delta might maintain the first order delta, which might maintain the final query. So that way we create a [indiscernible] of dependencies between those materialized views. And all of those materialized views are self-maintainable. Meaning there is no need for some additional database system that will help us to maintain these queries. So let's see how this compilation procedure works on a simple example. So here we have a two-way simple query, which joins together two relations, R and S. And has an aggregate on top of that. So in this particular case, what you want to produce are two trigger functions that will tell us what we need to do, how we're going to maintain the final result upon every insertions into one of these base relations. For the moment we're going to consider just the insertions, but we can also similarly construct the different programs for deletions or for arbitrary updates. So let's see how we're going to do that. The result of this delta query is just a single value so we're going to materialize this single value using variable Q as the first step of our procedure. And then we consider one new update to the relation S. The new result for this query is exactly the old query where the relation R is replaced by a union of R and delta R. And slightly use the SQL [indiscernible] here, hopefully you will get, you understand the point. However, we don't want to reinvent this query from scratch. We would like to derive a delta query which will help us to reuse the previous result. In this case the delta query is much simpler. It's the original query where R is replaced by delta R. Now, we're going to use this delta query to incrementally update the final result. Now, since this input change is very simple, we can actually explore the simplicity to replace these variables A and B that come from delta R with the actual constants delta A and delta B. Once we do that, we don't have the join inside this delta query. Now, since delta A doesn't depend on the S, we can use the distributivity of addition to pull out this constant in front of this delta query. What we have here is now a delta query that depends only on the relation S. But it has this input variable delta B. Which we are going to get rid of by residing this query using [indiscernible] and following by a lookup into this materialized result. Now we reached the end of the first cycle where we are going to recursively repeat this procedure. We have this delta query which depends only on S, right, and we're going to materialize this. Since it's a hash map for storing the materialized content on this. >>: So if [indiscernible] key, wouldn't you be materializing a lot, because it's leaning on the datasets? If you were to join on the keys, then this group A would be because of the datasets. Milos Nikokic: Yes. >>: So you would still materialize it or you have some [indiscernible]. Milos Nikokic: Of course. Yes. And we'll see why. Once we materialize this, now we can, suppose we have this materialized result, we can rewrite the delta query as a simple lookup operation into this materialized view, materialized map, followed by multiplication. That is our delta query that we are going to use to update the final result. However, this new view that we just materialized needs to be maintained. Right? So on every update to S, we need to compute a delta query for this materialized view. And in this case the delta query, because we are considering only singletuple updates; just a simple constant, delta C, which is going to update only one element of the materialized view. Okay. Now, we created one additional materialized view, and we created some statements in our trigger programs. Now we can proceed again by proceeding from the top level query, and considering updates to relation S. If we do that, we are going to, in a similar way, materialize one additional relation over, one additional view over relation R and add two new update statements. And that's it. That's the whole compilation procedure. And these are the trigger programs that we are generating. Notice the following. We materialize two additional views, right, but these views are not the whole base relations. These are group-bi's over certain columns that are using that query, right. So we project away all unused columns and aggregate over the ones that are used. So in case of Bs is a primary key, right, we want to reduce the number of tuples, but we want to project away the A column because we don't need for this maintenance. So the important thing about this, this generated programs, is that these triggers run in constant time. In this particular example, if we have updates to relation R or S, we just need to perform a constant amount of work in order to update the result Q and all our materialized views. So in general, our trigger programs might not have a constant turning type. But in many other cases, it provides some benefits that cannot be achieved through other incremental view maintenance techniques. For example, classical maintenance view technique cannot incrementally maintain the result of this query in constant amount of work. Okay, once we have these trigger programs, the second phase of our compiler is the actual code generation phase. And here we're going to use a specialized data structures for storing the contents of these materialized views. In our case, since these maps represent group by queries, we're going to use a hash map with memory-pooled records for storing the contents of these two additional views. And we are going to specialize the result, the final result, the type of the final result, to be a primitive type. Now, for generating the actual trigger programs, we just need to transform these maintenance statements into the actual code. In this particular case it might seem straightforward, because they are very simple. But in general they might involve also some joint processing in the actual code generation. Okay. So how does this perform in practice? Well, this is the graph that we already seen. Except that we have another column here which does, which is DBToaster with the classical incremental view maintenance. So basically with that additional figure we want to measure the impact of our recursive incremental compilation algorithm. So as you can see, just by using DBToaster in code generation we can get one to two orders of better performance than existing on commercial database systems. But then if we just use this recursive incremental view maintenance approach, we can get up to five orders of magnitude better performance on average than# these systems. Yeah? >>: [Indiscernible]? Milos Nikokic: So the memory overhead of using DBToaster obviously has some minor overhead. Pretty much depends on the query that you are targeting. But let's say considering typical star schema query, all the materialized views cannot have, will not have more than, more tuples than the fact table. That's for one. And in practice, they will always, they will usually have much fewer tuples. Why? Because we are going to include a static conditions for a given query whenever it's possible to filter out tuples that don't match static conditions. That's one. The other thing is we are going to project away columns that are not used in, for that particular query workload. So we don't need to sort the whole materialized relation, the whole base relation. And the third one, we are going to aggregate over those material, these material views are aggregate representations of base tables. So if we aggregate over some columns with very small domain, right, that might be very, that might have very small memory [indiscernible]. But in the case of -- I say TPC-H queries. It's usually much lower than the size of the base relations. >>: Couple of questions. So one is, so the idea that you can materialize additional views for the sole purpose of speeding up and pretty much maintaining [indiscernible]. This is one idea that you have. Did I misunderstand? Milos Nikokic: No. >>: So a few years ago there was a series of work done by [indiscernible] and others which essentially puts you with the same idea. So can you comment on the [indiscernible]. Milos Nikokic: Sure. So the problem of view selections and using these views to answer some other queries, right, it's nothing new to, it's something that is invented by DBToaster. It has been before. What's known in DBToaster is this recursive incremental view maintenance where we have a way of transforming one query into delta queries into hierarchy of delta queries which maintain each other results. Right. Yes, that might be also possible to achieve before that, right. But there's also nice theoretical result which was not done by me, right, which tells if you use this recursive view maintenance technique, then every materialized view, for every materialized view that you have in your system, you just, for every value inside, you just need to perform a constant amount of work to update that value. So that means that this hierarchy of incremental, hierarchy of materialized views is essentially embarrassingly parallel. Right. Whether you can exploit that in practice, that's not a question. But it's nice theoretical result that separates classical incremental view maintenance from recursive incremental maintenance. >>: The question, the speed up that you get, can you break down how much you get from the code CodeGen versus -Milos Nikokic: Yeah, so that's the DBToaster with I VM. So DBT I VM, that's just code generation with classical incremental view maintenance. And DBToaster is code generation with recursive incremental view maintenance. So as you can see from this graph, using code generation and classical incremental view maintenance, can you get one to two orders of magnitude better performance. But the real benefit comes from using this recursive incremental approach where you can get up to five. >>: [Indiscernible] Milos Nikokic: So can we use other drawings if we don't support null at the moment. So we actually had to change some of these queries to replace other outer joins with inner joins and to avoid, and to remove order by clauses. Because we don't support sorting. >>: [Indiscernible]. Milos Nikokic: It is a problem. Because you would have to maintain the whole, it's like mean and max. You would have to maintain the whole history of this base of values. Right. Which we don't support for the moment. >>: [Indiscernible] Milos Nikokic: We have the full TPC-H benchmark, although we modified some of them of course, right, to avoid these restrictions. But we don't support them. So we have results for TPC-H and also for some TPC, DC, DS queries. >>: So do you have a graph for the memory utilization for these systems? Milos Nikokic: No. I don't have a slide. We have it in the paper. >>: What does it look like? Milos Nikokic: Well, depends on the query; right. For some queries it can stabilize. Because you materialize the view and you have seen all distinct values from your keys, right. So whatever you aggregate on top of that it won't increase your memory consumption. But there are queries for which, as you keep seeing new data, new keys, new primary keys, the memory consumption will grow, right. But that's the same case as with other systems. So the relative ratio, I would say that for this kind of workload is that the size of this materialize views is usually much smaller than the size of these base relations. Just because we project away columns. We just keep only the minimum amount of data that 23 need to maintain those queries. Okay. So, yeah, one can see this is for single-tuple processing. What about batch processing? Database systems are much better for doing that. So we did an experiment where we use a commercial database systems and we varied the size of the batch and the size of the input batch that we are processing. And we compare incremental view maintenance [indiscernible]. We can see that as we increase the batch size, obviously as expected, the number of processing tuples per second will also go up. However, even for very large batch sizes, these performance numbers are still around 300X slower than DBToaster with single-tuple processing for this particular query. Other queries might differ, but still there is at least one or two orders of magnitude better performance with DBToaster. Now, the question is can we use batching inside DBToaster and get similar performance in programs? Well, the answer is yes and no. So there is traditional belief that batching will always improve performance. And that's true, because it might better cache locality. Might also amortize invocation cost these trigger functions. And batching is also crucial for distributed execution. Because we want to amortize natural communication costs. So how do we support batching or how do we do batch procession inside DBToaster? So we're again considering the same query, right, a simple two-way join with an aggregate on top. For that delta query, we're going to materialize the result in a variable queue and then we're going to derive queries for updates to relation R. Right, the delta query is simple enough we just replace R with delta R. Then we can do some traditional query processing with optimizations like pushing the aggregation on the expression tree in order to minimize the amount of data that's been interjoined. And in this case we can explore the simplitivity of this sum operation to push the sum down the expression tree and compute aggregates over delta R and S. And if you notice, the right upper end of this join tree is exactly the materialized view that we also had for single-tuple execution. So we're going to materialize this query and we're going to repeat the same procedure for updates to another relation, S. And in this procedure we're going to materialize an aggregate over the relation R. Now we have two additional maps, like before, right, that one of them is over S, one of them is over R, and the final is on Q. So an update, a trigger update for R, we need to maintain Q and MR. The delta query for this MR is a simple aggregate over the delta R input batch. And if you notice, that is exactly the same expression that is used for maintaining Q. So we can exploit some commonalities between those delta expressions to actually reduce some computation and evaluate this delta expression sufficiently. So in general for this query if we were right to, if you are going to write a trigger program for this query, it consists of three steps. First we're going to pre-aggregate the input batch that we are going to get. Second, we are going to use that batch to compute the delta Q. And finally we are going to use pre-aggregate the batch to update the materialized relation. These are three basic steps. You don't need to understand them. But the main point here is that we can pre-aggregate the input batch, explore the fact that we have a collection of tuples. We can try to reduce their size as a first step. And then use this aggregate representation in to further computation of R delta programs. Now, how are we going to translate this program into C? Well, for pre-aggregation, we had to use intermediate collections, right. That involves looking over the input batch and aggregating the results. For updating Q, we're going to value this join as a N memory has joined, where again we're going to loop over this aggregate representation of batch. And for updating R, it's just looking over this pre-aggregate batch and updating the final result. Again, there's no need to understand this code. But this is what a trigger program would look like in C if you were to compile this input query. Now, compare this batch version versus the single-tuple version. Obviously there is the single-tuple version is relatively simple. It would [indiscernible] statement, where this batch version has much more, performs much more computation for the given input and the size. Notice how there are all differences here. First the single-tuple version has specialized primitive-type trigger parameters. Whereas in the batch case we have at least a generic least of tuples. That is actually a powerful optimization that allows the C++ compiler to treat this delta N, delta B as constants and move them out of loops during this compilation to machine code. Second of all, the single-tuple version doesn't have any intermediate results. We can eliminate those loops as well because we know that the input [indiscernible] sizes of size 1. And we can also do partial revelation, inlining, and things like that. Right? So if we were to specialize this code, we'll get the single-tuple code. Now, the question is, which one is going to perform better? Right? The single-tuple code obviously has better instruction cache locality, because it's much more. The batch might have better data locality, because it's going to iterate over# these collections of data, right amortized loops. And the batch version can also profit from this pre-aggregation from this first step because in this step we can in some queries, we might reduce the domain to relatively few values and then make subsequent computation very cheap. So we did an experiment comparing the single-tuple performance with batching performance for the full TPC-H benchmark, all 22 queries. So what we did here is we measured the performance of single-tuple processing and we measured the performance of batch processing for different batch sizes. We varied batch sizes from 1 to 100,000. And for each query we picked the best batch size and the best throughput numbers that we get. Right? And then we normalize these numbers by the single-tuple performance. That's what is shown on the X axis. And the Y axis shows the cumulative count for these queries. So let me explain this curve. So what we have here, what we can see from this graph is that for five queries, the relative ratio of this throughput is less than 1. That means the single-tuple procession is better than any batch processing for any batch size >>: For batch size 1, because you have [indiscernible]. Milos Nikokic: Yeah, for batch size 1 it's always better. >>: Yeah. Milos Nikokic: But even if you increase the batch size, that pre-aggregation step doesn't help you that much. You might be pre-aggregating over primary keys, so want to reduce the size of the input batch. And you won't see any benefit. In fact, you will see worse performance. For like half of our workload, we can see that single-tuple performance is maybe 20 percent worse than at most 20 percent worse than batch processing. For 40 queries, it's like 50 percent. But notice also that for this best batch throughput, for different queries, that might be different batch sizes. So if we have just one batch size these numbers might be even worse and might be even better for in favor of single-tuple processing. However, there are certain queries where batch processing can actually deliver very great performance. And for like four or five queries, we have seen improvements after 4,000 x. And this is mostly because of batch pre-aggregation, which allows us to reduce the domain to very few elements such that some sequence maintenance statements are relatively cheap. Okay. What about distributed execution? Well, the goal of this part is to transform this local incremental programs into distributive programs that can run on large-scale processing platforms. And we do that for two reasons. We want to speed up query [indiscernible] or maintenance work and we want to accommodate more data. Now, one crucial difference between this board and classical query optimization is that we're trying to optimize, we're trying to distribute this incremental programs. These programs consist of triggers, and each trigger consist of statements. And these statements, we have data flow dependencies, like in any other program. Right? So for example, delta in the first statement might be using views that are maintaining the second statement or in the sixth statement. That means we cannot arbitrary reorder them. So we have to be careful about how we're going to inter leave the execution of these statements in this environment. So that is the reason. So our first attempt was actually to execute these trigger programs on a distributed system with a synchronous execution model. And that turned out to be quite bad, because of this synchronization is that we had some computation that we had to introduce for dealing with this, for ensuring the consistency of these programs. So our approach is to distribute this incremental programs using processing platforms with synchronous execution models. For example, to execute this program on top of Spark or Hadoop, right. So we assumed we have one master node with a bunch of workers. And we represent this computation as a sequence of jobs. Now, in order to achieve these executions, we had to change our language and introduce some location tags or to mark each expression with some information about where it is located in the distributive environment. So each expression can one of three possible values. For instance, it can be marked as local, meaning that the expression is executed and the result is stored on a master node, can be distributed by key meaning that all the whole materialized view is partitioned across all workers by certain key, or it can be distributed randomly, which is usually the case with input batches. And we also introduce some communication primitives in order to achieve this distributing execution. And there's a very well-known communication primitives that allows us to, for example, repartition one distributed collection or transform one distributing to one local collection or distribute to the local collection and to across many workers. So the way we do that, we start with a local statement, single statement. And we, for each materialized view in this graph, we are going to annotate to say where the location of this result is. We assumed that such information is provided from the outside, it doesn't matter in our case. So let's say that delta R, the input batch is randomly partitioned across all workers, that the materialized view MS is partitioned across by column B. And the final result is stored on a local node. So how do we build this distributed statement? Well, we traverse this tree in a bottom-up fashion and try to introduce communication primitives where they are necessary in order to preserve the semantics of these queries operations. For instance, since we have to compute the join between those relations, they have to be partitioned on the same join key. So that means that we have to repartition one of the input batches. And once they are partitioned by the same key, we can perform local join operations followed by local aggregations, and then finally we can introduce one gather primitive to collect the result on the master node. So that way we can execute our local statement as a sequence of stages, and these stages can be either distributed stages like happening in parallel on every worker, or they can be local stages which are happening only on the master node. We also have some communication stages where we need to send data around and in order to ensure correct semantics of query execution. Now, that is how we transfer one statement. So these programs have multiple statements and we can simply concatenate those statements saying this is the distributive program. From here you can see it's a relatively straightforward to generate some code that would run on top of Spark. Or this code is going to be very inefficient because it requires all of the communication. So what we do in this stage, we try to reorder this. What we do in this phase, we try to reorder those stages and merge those stages that can be merged that are compatible. For example we can merge these distributed stages or we can merge local stages. And while reordering this, we try to preserve the statement dependencies that exist in the input program. Once we have that optimized distributive programs, we can generate code for any system that has a synchronized execution model. In our case we generate Spark code. So we did some experiments. In this case we have a stream of 500 gigabytes, synthesized from a 500 gigabytes TPC-H database. And the batch sizes fixed at 200 million. And what we measured here, we compare the time it takes to process one batch with respect to different number of workers. And we have several queries here, starting from very simple one, like query 6, which just requires a local pre-aggregation and one collection phase on the master node. and as you can see with our implementation on top of Spark, we can go around as low as 0.5 seconds for processing this 200 million tuples using, I think, 800 nodes. For other queries like query 3 and query 17, they require two communication stages. And they also require a certain amount of data to be transferred. So in those two examples, performance is much higher. So the latency is a bit higher around 4 seconds. And this is mostly comes from the overhead of the underlying execution engine, in this case par. Since we have to execute several jobs in order to maintain this query, the internal overhead of Spark are actually causing this latency. And for some queries like query 7 where we have more than two communication stages, and in those communication stages we shuffle a lot of data, the performance really suffers and it's around 32 seconds, flattens out with 200 nodes. And basically this is something that we cannot improve using Spark. Because mostly the latency is affected by the garbage collection that is happening on these worker nodes. And we don't have a way to actually change that. So building a specialized system for these kinds of workloads would really help us here. Okay. With that I would like to conclude this first part of the talk. And then I'll try to jump to the second part. So the second part, we're going to talk about incremental processing of machine learning of linear algebra programs. So many existing machine learning algorithms can be described as iteratively [indiscernible] programs. For instance, if you take the ordinary [indiscernible] square method, which is a common way to fit the current data, it can be expressed as a sequence of operations over matrices. Now, in this case, the operations involve classical [indiscernible] multiplication inversion and transposed. Now, the problem with this analytics is that we have to work with increasingly large datasets, which is becoming excessive. And this data are changing, usually through small changes compared to the overall dataset size. So we build a system called LINVIEW that performs incremental processing of incremental of iteratively knowledgeable programs. So inputs to this system, we take APL-style programs, like MATLAB programs, Octave programs, and we transform them into incremental engines that efficiently maintain the query result. Now, as input to this system, we consider language consisting of standard matrix operations like multiplication, addition, transpose, matrix inversion. And this language is sufficient enough to express many of the machine learning algorithms. So let's see what it means to perform incremental processing in the context of linear algebra. So here we have an example of matrix power computation where we want to compute the fourth power of again matrix S, which is a very common task in many realtime applications. For instance, like in graph processing, to compute reachability between nodes or to end page range computation. So we want to express this computation in a sequence of two statements. And in this statement they involve full matrix multiplication. In all of these cases they're considered dense matrixes. And the complexity of these matrix multiplication with some specification, it's N to the cube. Now, when the input matrix changed, let's say one element just changes, we can recompute everything from scratch, but that might be costly, especially if the updates are small. What we want to do instead is to derive this incremental program which is hopefully simpler enough. And it will tell us how to compute only the delta change in the result more efficiently. Now, the linear approach is beneficial for linear algebra programs that involve matrix operations. Now, the complexity of the original program has to be N to the cube. And for those kinds of programs, we're going to generate incremental programs that will explore the sparsity of this input to achieve better syntactic behavior. now, the question is, how are we going to derive this incremental programs, how are we going to evaluate this delta expressions, and how are we going to represent them? Now, for delta derivation, we're going to rely on the basic properties of matrix operations. For instance, if you want to incrementalize this matrix multiplication, we'll just say that we're going to rely on the distributivity of matrix multiplication with respect to addition to derive at delta B which consists of the sum of 3 terms. This is relatively straightforward. Now how are we going to evaluate those kinds of deltas? Suppose that the input delta consists of single entry change. Now in order to compute delta B, we're going to evaluate this sum of 3 expressions. And the final output will contain changes in one row and one column. With these blue markers, I'm going to denote entries that might be potentially known zero entries in delta matrixes. So we can see that for a single entry change in the input matrix, the result, the delta D, has potentially changes in one row and one column. Now, if we're going to push this delta to the second computation that computes the delta C, we can see that the whole output is going to be changed. Well, that causes a problem. Because while it does [indiscernible] the first that incremental evaluation has a lower bound offense square, so we had to change all the elements, so we cannot go below that. The second thing is if you want to use this delta C in subsequence with the delta evaluations, we're going to end up doing full matrix multiplications. So very quickly in just two iteration steps, incremental evaluation will lose its benefit over reevaluation. And we call this the avalanche effect. A single local change has exploded in just two steps and contaminated the whole output. Now, how are we going to deal with this? Well, we just seen that if you represent deltas as single matrixes, it's not going to work. Now, the main insight of our work is that all of these delta matrices might have known zero elements everywhere. They often have simple structure, meaning that they can be decomposed and [indiscernible] as a product of two matrices, two small matrices, that have low ranges. For instance, if you take this input delta input which has only one element change, we can compress and represent the same change as a vector outer product, as a product of two vectors. And I'm going to argue that this factor representation of these deltas can help us to achieve efficient evaluation. Let's see how we can do that. Suppose that delta A is given input change and is an outer proved two vectors. It's called rank 1 update. Now, when we want to compute delta B, which is, which we do in the same way as before, now we're going to, instead of computing the outer product between those two vectors and then multiplying the matrix, that would cause us N cube complexity, which we want to avoid. So what we want to do instead, we want to explore the associativity of matrix operations to first perform the [indiscernible] between this matrix and vector. And once we have that, we can represent delta B as a sum of two other products. Now, if we wanted to update B, we can recompute the result. But if you want to propagate delta D further down the expression, we are going to always use this factor representation as a sum of outer product, which will allow us to express delta C as a form of 4 outer products. What that brings us is that we can do that, now delta relation using only N square operations. There's no matrix metrics multiplications any more involved in this incremental program. So we can also support generally iterating models. As you can see with every delta propagation step, the size of sum of Delta, of outer products, is actually increasing. But that's not the problem, because many practical algorithms, they often converge in very few iterations. For example, [indiscernible] conversion less than thirty iterations. So how do we incrementalize iterative models? Well, we started from the state. And we have an iterative function, which creates a sequence of states. And we are going to now, when the input changes, we can recompute all the states again or we can simply do incremental evaluation. We can materialize the internal, every intermediate state. Then we can derive delta queries and we can use them to update the intermediate state of this iterative program. So just to give you a short view of our results. Here I'm showing you different use cases of complex analytics. And they're complexity in terms of the big annotation for reevaluation and incremental maintenance that is our approach. And as you can see that in all of these cases we managed to bring down the complexity from N cubed to N squared, just by using this factor representation and completely avoiding matrix operations. The only case where we cannot go below, be better than reevaluation, is when the input problem doesn't consist, doesn't involve any matrix. >>: [Indiscernible]. Milos Nikokic: Yes. >>: So can you build a model and incrementally. Make it using the same [indiscernible]. Milos Nikokic: Yes, so this is general model. For example, you can fit linear regression model into this, express it as an iterative program. Or any other program that you can compose using these basic matrix operations, which doesn't involve vectors. Because matrix vector operation is N square and we cannot beat that. That's the lower bound. So if your input algorithm involves matrix metrics operations, yes, you can incrementalize in this way. Okay. So how well does it perform in practice? So here we measure the performance on top of Spark for up to 100 nodes. And we measure the performance for doing the 16 power matrix operation computing at the 16 power of given matrix A, the size of A is fixed. And what we measure here is how much time it takes to process one rank-1 update. That means, for example, changes in one row or changes in one column. And as you can see that reevaluation is, as you add more nodes right, then the time it takes to reevaluate is obviously going down, right. But it's still order magnitude slower than our approach. And in our case, it seems like that incremental [indiscernible] doesn't scale. Which is not through. Because for incremental view maintenance, the input problem size is actually too small. And all of these numbers, but you can see below are just overheads of the communication overheads from the underlying engine, Spark. We also tried to fix the number of workers to 100 and then we varied the dimension size and measured how much it takes to process one update. And as you can see the real evaluation is actually the cost is going exponentially. Or actually [indiscernible]. But in our case, because so the reevaluation is going [indiscernible] because it requires reshuffling of the whole matrix. In our case we're just reshuffling vectors, right, which is linear time operation. And we can handle much larger matrices in this case. Okay. With that I would like to conclude. So today I presented you two systems that I worked on. And the first one is DBToaster which is a SQL compiler that produces lightweight stream processing engines which are tailored for a given query workload. It is based on a recursive [indiscernible], and it supports single-tuple and batch execution in both local and distributive environments. In practice, DBToaster can achieve up to five orders of magnitude better performance than other commercial systems. And actually DBToaster is released. You can go to our website and you can download and play with it. The second part of this talk was focused on incremental evaluation of linear algebra programs where the main insight is how to, where the main problem and challenge is how to contain this [indiscernible] where factor where single entry change can quickly contaminate the whole output. To deal with this program, we introduced this factor, delta representation, which allows us to contain the avalanche effect and achieve asymptotic improvements compared to [indiscernible]. Okay. Just a short about some possible future works. Right now we have seen that sensor data are generating enormous amounts of data, right. And this date it is usually shift to the Cloud storage where it's processed by data practitioners. And soon enough this approach won't be able to scale and won't be able to support large amounts of data that are being increasingly generated by sensors. So what you want to achieve here is we want to push computation that is happening right now inside the Cloud, we want to push it all the way to the edges of this sensor networks and perform both relational operations as well as the main specific operations inside sensors before shipping the aggregated result or the valuable insight back to the Cloud storage. Right. And the work that I've done together with Jonathan and [indiscernible] here is one-step in that direction where we try to introduce DSP operations inside a stream processing engine. And it turns out that it can be very successful. We can leverage existing algorithms for implementing this system. Now, the challenge is how do we do similar stuff can other domains like numerical computations, approximations, statistical models, and things like that. So I believe this is the one challenging direction that we can take. Okay. With that, I would like to thank my collaborators from EPFL from whom I worked on these two projects, also Jonathan and Chris, with whom I worked on the Trill. And yeah, I'm glad to take any questions that you have. [Applause] >>: So coming back to the first part of talk, the DBT server. So you were using [indiscernible] or processing there [indiscernible]? Milos Nikokic: No, DBToaster is just in single node. >>: I see. So what are the bottlenecks? Was it the memory or algorithmic or was it the hash tag [indiscernible]? Milos Nikokic: Yes. >>: For the kind of [indiscernible] that you were looking at? Milos Nikokic: Yeah, so it was the memory bottleneck. So we don't have a very good cache performance. Mostly because it's a streaming workload and we need to do random lookups to clear out our cache. Right, we have to have good instruction cache locality, right, which is very important for, because it's very hard to mask those kind of cache misses. But for data locality, I mean in this experience scenarios we are doing our best. And our data structures are optimized for this kind. But we still have kind of like 40 percent of cache misses just because we just have streaming data, which is clearing out our caches. >>: So the numbers that you project for the [indiscernible] a million tuples per second? [indiscernible] so does that [indiscernible] directly to around a million cache look ups that you're looking at? Because reasonably you bought a machine that you can get out a million hash tags per second. Milos Nikokic: So we do, even, we do have better results than that. So for some queries I think we are going around thirty million tuples per second. Yes, that would translate into, sure, into hash lookups that we're doing at least that many. And in some cases, because we have multiple trigger statements, we are doing even more. Right. >>: So more joins means more hash lookups. Milos Nikokic: Well, we tried to avoid hash drawings. We try to avoid drawings because this recursive compilation will often enough completely eliminate drawings from our creative processing. All we're doing is summing up iterating over sum maps and as the updating of the result. We don't do any joint processing often enough. Not in all cases. So that's basically one idea that with this recursive compilation you can get rid of this classical database operators that we have. >>: [Indiscernible] that you're looking at. So that's the only one [indiscernible]. Milos Nikokic: Yes. Yeah, and our internal implementation is in memory hash drawing. >>: So what you have done is you create a [indiscernible] join and the application and [indiscernible]? >>: [Indiscernible]. Milos Nikokic: Yeah, these views are already pre-aggregated. So in many of these cases, you wanted to do any preaggregation on top of that. You just do a lookup and update the result. Milos Nikokic: Do we have any more questions for Milos? Okay then I would like to thank our speaker. Milos Nikokic: Thank you. [Applause]