25282 >>Paul Larson: We are pleased to welcome Volker Markl from... Berlin. Most of you know Volker quite well. ...

advertisement

25282

>>Paul Larson: We are pleased to welcome Volker Markl from now Technical University in

Berlin. Most of you know Volker quite well. He did his PhD with Rudy Bauer many years ago, then he got distracted and actually worked for IBM, but finally saw the light and returned to

Germany to the Technical University in Berlin. We are pleased to have him here, and he is bringing several students with him. He must be pushing them hard because one of them got a cold and couldn't show up. But two survived, so they're going to give talks, and Volker is going to say a few words about what’s included in this presentation.

>> Volker Markl: So thank you for the introduction Paul, and thanks for the opportunity for all of us to be here with Kostas and Stephan and myself. So I will not talk too much in the beginning because this is really are about Kostas and Stephan to present some of the work, just briefly the logic context that all of this work operates in. So at TU Berlin, we are conducting research in one major research area that is large scale data management massively parallel data processing. And we have a relatively large research project, they are now in the order of

12 PhD students are working on, actually building a system so an existence building in a project called Stratosphere which you also see here. The idea is to build a query processor parallelized automatically in a large cluster. You’ll hear a lot about that from Kostas and Stephan.

What I just want to briefly mention though, the original plan was to have two more students of mine here, which we actually had some struggle as Paul had already mentioned, so, one of my students, Thomas Bodner, he got flu or cold or something just on the plane or whatnot, and so he is in Seattle, but he is sick today. And the other one, Alexander Alexandroff, he actually is, had some visa issues and will only arrive tonight because his visa was delayed. So it's a bit unfortunate, so there's two talks which we’ll probably talk about next time, or later on after the talks I’ll give some brief summary on those talks. One of them in big data benchmarking. A massively parallel data generator where there’s some challenges. How can you generate now terabytes and petabytes of data fast? You have to do it in parallel. And how can you do that with maintaining correlation? So that’s one aspect. And the other student, Thomas, his work is on use case repository for big data. So actually, having some, collecting some use cases, we’re doing that together with my Kerry's group at University of Irvine, so that's the other aspect.

So those are two things I'll be happy to talk about when I talk after the presentations, but without much further ado, I would hand over now Kostas to give a brief overview of

Stratosphere and then focus on query optimizations and they produce functions and then

Stefan. I should also mention this briefly, so Kostas is a, so many of you may know him, he already was interning at Microsoft two years ago, and he is a post-op researcher in my group and really driving a lot of the efforts in this Stratosphere project. Stephan is a now fourth year

PhD student who will soon graduate, and he's doing a lot of work in query optimization and in

particularly parallelization of query processing, and so Kostas here got his PhD with Christian

Yenzin[phonetic] in Arlberg and was in Microsoft when I was working here, and now he's working here with me. He's been working with me for a little bit more than a year, and Stephan actually did, was a summer internships at Almaden already and will probably graduate, hopefully graduate next year. So with that, I'll hand over to Kostas.

>> Kostas Tzoumas: Markl for the introduction and it’s, thank you for being here. So I will give a brief overview of the Stratosphere system. I’ll talk a little bit about the architecture and then I will dive in to some specific aspects of the query optimizer that are quite novel. They’re quite novel, the context of the system. Good. So the motivation of this, of the project as a whole is big data, so we're seeing nowadays huge amounts of data, human generated, machine generated, and at the same time we also see a need for more complex and more deep queries.

So queries that go beyond the traditional data warehousing queries. And this is used in a couple of contexts, so, in big science, in business, all people are doing these things. So as a result, over the last couple of years, so a few years by now, you have seen a bunch of completely new systems that are not traditional relation DBMS’s, but are used to do this kind of complex analysis. And so I'm going to argue that this really started from Google StarBoard Use

System, and then we have the Hadoop and social implementation, a couple of more systems in this category is the Asterix/ Hyracks System of the University of California Irvine, so my Kerry’s group and the SCOPE System which is used here in Bink.

And the Stratosphere is also one of the systems, and this is the one I will talk about today. So

Stratosphere stands from the layer of, about the atmosphere, so above clouds. Good. So, the architecture looks like that. So first of all, the Stratosphere is a layered system, so there are a lot of, there are a few components in the stack and it actually exposes programming API’s to a few of these components. So on the bottom, on the bottom level you have a parallel dataflow engine which is called Nephele, so this engine essentially operates on [inaudible] other part of the programs consisting of tasks and [inaudible]. We have a layer that does all the query processing, so its own algorithm, sort and classing and so on. And on top of that we have an optimizer. So an optimizer you get, the second entry part of the system is what we call a PACT program, and then on top of that one can actually implement languages, so one can implement query languages. And we have a query language called Meteor, and in addition there are some effort, support a Scala [inaudible].

So a few words about its component. So the Meteor language is a language, secure language for semi structured data, so the data model here is sensilization [phonetic]. It contains operators to do classic relation processing, and in addition to those it contains operators to do information extraction from [inaudible]. So this is actually developed at some partners of the

[inaudible] Institute, but they're very interested in doing data cleansing and data mining and so on.

Now the PACT program model is essentially a new programming model that is based on and actually extends MapReduce as a programming abstraction. So if you think about MapReduce, they're essentially two set on order functions and we extend that provide additional

[inaudible]. So I’ll talk in detail about that later.

The runtime operators, so the runtime essentially monitors memory, it monitors IO, and because all the [inaudible] processing stuff. The dataflow ends and gets an abstraction, there is a DAG, a parallel DAG, and it does things like resource allocation, scheduling, communication and so on. And finally, so Stratosphere has a cost-based optimizer. And this cost-based optimizer, they optimize a PACT program or after a few [inaudible] steps and so on the Meteor script and then a [inaudible] Scala program, it picks things like the physical [inaudible] strategies, and so for example, [inaudible] or [inaudible], it picks things like partitioning strategies, so how will I partition my data, and finally, it picks operator order. So the focus of my talk will be on the Stratosphere optimizer.

Before I get there, so, I have a few examples of this of the queries that you can run. So as I said, you have optimization data model and you can write small script, really for rapid processing, so on the top example we’re reading JSON file and filtering on the bottom example we’re transforming some rows, you can do things like [inaudible], so down here we’re joining JSON files, and in addition to those, you can actually have user defined functions. So, you can call a user defined function [inaudible]. So this is the top example over there. And actually, for most of my talk I will focus on how to integrate this Java user-defined function based optimizer.

>>: [inaudible]

>> Kostas Tzoumas: So it's very much circle. It's actually, it’s very much circle with a few extensions for data cleansing and so on. Now this language is compiled into immediate representation that will go SOPREMO that stands for Stratosphere Operator and Data Model which is an essentially an operator library with 35 operators that you can actually extend as a user and these, so just to give you the architecture of UV’s are the entrance laid into these

PACT programs. One optimization that you can do there is how to efficiently pack the nested data into records so you can process them efficiently in the runtime. So the runtime works on the record data model.

Now the PACT program model. So the basic idea here is that we have a program model where you have, so PACT is sold for parallelization contract which is essentially a first order function which is written by the user in Java wrapped around a second order function signature which is provided by the system. So it is picked by a fixed collection of second order functions. So the

basic observation is that the second order function will describe how the data can be grouped into independent subsets which can be processed together. So the observation and then the user-defined function will be called once for each of these data groups. So the observation was that in the MapReduce paradigm, what we have essentially is a second order function that we call Map, right, and a first-order function that operates on the record what the Map tells us is that each of these records is an independent substance and the UDF is called one per have

[inaudible], so one per record. The reduce essentially tells us that every, that all records that say the value of the keys here, the two records of the blue key will be an independent group and the UDF will operate on these groups. So basically, you can actually generalize that and we generalized it to capture a few more relational style processing. So you can, for example, to find a Cross PACT which is essentially a Cartesian product. So you will get two inputs and your second order function will take every possible pair of records and create an independent substance for those. You can have something like an [inaudible] some which we call Maps. So here you will have a subset, an independent subset for every two records of [inaudible] Maps

[inaudible], and you can call actually UDF on that pair of records. And we have a co-group which essentially generalizes reduce to two dimensions. So for data sets, you have a reduce on two data sets. Yeah. So these are fakes, these are hard coiled in the system, so these are the five ones that we have up to now but we are actually interested in looking at others, so someone one can actually [inaudible] for more PACTS to do things like stream processing, to do things like similarity zones[phonetic] and so on.

So the PACT program model then essentially is a data flow graph composed of these PACTS. So we have a direct single graph with input sources, output sinks and these PACT operators contain a second order function and a first order function. So here, for example, you have two sources that extract two attributes from two text files, for example, we have connected two model operators that do some kind of processing and then those go to Maps PACT, which go to a [inaudible] and go to a [inaudible]. So one thing to note here is that you don’t, you’re not restricted as in MapReduce to have an input source, A MapReduce, and an output, but you can arbitrarily connect this in any kind of the form you want. So as long as it's a DAG. So another fine point here is that this is a program model that is exposed to the user so essentially use that to write your UDF's, but it's also currently, the intermittent presentation that's used by the optimizer to numerate plugs. Now these PACT programs passing through the optimizer create a plan, so a parallel plan that given to this Nephele data flow [inaudible]. So how this looks like is basically, you know, a DAG of tasks and they’re connected by channels. You know, the fun thing here is you have three kinds of channels. So you have even number of channels, you have network channels, and you have [inaudible] channels. And what that allows for is by-planning.

So you don't always have to materialize data and then read it from this [inaudible]. And this gets spawned to a parallel program depending on the degree of parallelism and is executed in a cluster. So this [inaudible] gives a sort of, a you know, an overview of what's happening, so we

PACT program on the left and some user code, this gets compiled to this DAG and some code, so here a [inaudible] is wrapped around the user code and then some network communication goes right around that and then given a task to be executed in PACT.

Good. Now, so now we move to the main part of my talk. So I will talk about the Stratosphere optimizer. So we describe some aspects of the [inaudible] paper, so 2012 that we called opening the black boxes, so we will probably see why. So the Stratosphere optimizer is a costbased query optimizer that enumerates plans, so currently [inaudible]that passes interesting properties top down. So basically, you are given a PACT program and you generate an equivalent PACT program that has a lower cost. And what I will focus on today is another aspect which is how to integrate the small produce W's, or defined functions fully in the query optimizer. And the way we do that here, so basically the problem is that you don't have semantics about the UDF’s. So what we did is to analyze the user code of that the UDF’s using a compiler to extract a few properties that we can use, and based on these properties, I would say transformation [inaudible]. So the motivation for doing that is that this, so this programming model I describe is actually quite general. So it's not only used in Stratosphere but a variation of this model so direct the [inaudible] graph operators that contained this kind of style of UDF’s is also used, for example, in SCOPE and it's also getting traction in relation to

BMS. So both Greenplum and Aster are trying to integrate and have, to some extent, integrate these MapReduce functions. And so it's a timely problem. The problem, as I said, is that you don't know what's inside this function, so you have to do something about that. And the novel thing about Stratosphere is how to deeply imbed those in the optimizer, including reordering them in a plan.

So the first step, is that you will take each one of those and you will run a static code analysis through all of them, so we use the Java tool called Suite that operates on some kind of intermediate presentation of the Java code. There are also tools for byte codes, so for example, if you don't have the original Java source you can do it with byte code. And what we’re going to find is four properties of the user function. The first is the outward schema. So based, assuming that we know the schemas of the [inaudible], we will infer the schema of the output. The second is the read set, which we loosely here defined as what are the attributes of the inputs that may influence somehow the output of the operator. The third is the write set, which is again loosely defined, what are the attributes of the output that might be different than the corresponding input attribute. Yeah? And final is the emit cardinality, which says, so, per call, how many records are you going to meet? Yeah? So the way to do this is we start a cold analysis in our case, so here we see an example user function for a Maps PACT, which basically copies out the left input record and has a filter of an attribute of the left record, based on that conditional it either does something using some attribute of the write record and then does some more things and then it emits a record. So keep in mind that, so this is a Maps,

right, this was modeling a [inaudible], so the second order function essentially takes every two records Maps on the key, however, you can put anything you want in there, right? So it’s not a

[inaudible] in the sense that you just coordinate the records and emit, but you can really write

[inaudible]. So the important part of this particular Java code is that it's not [inaudible].

So what we have, is essentially we have a record data model that we expose to the programmer, and the problem really has a fixed API for manipulating these records. So for example, in our case, we have an API to create a blank record, to copy a record, to get set based on a field number, and so on and so on. So this is very similar to big API also. And, I believe, all to the SCOPE with some, I guess, differences. And the final thing is that, so remember the PACT program is a data flow there's really no control flow between two operators. So two operators can only communicate by extending the data set. So you can localize your analysis into its operator. So when you have [inaudible] properties what you actually can do is, so to approximate the read set, you will go and you will scan all the [inaudible] statements, right, so from the statements that give an attribute from a given record, you can find out which attributes you're reading. To approximate the write set, what you actually do is go and look at the set statements, right? So whenever you set something on your output record, you need to check is that something, so you need to basically reversely follow the data flow graph of the program and look, is that something that could people be selling you, or is just something that I copy from my input. And it's if it's something [inaudible], then you can add this to your write set. So in this example, for example, if I know that my input one had [inaudible] IBC and my input two had [inaudible] IDF, I know my output and I can approximate by the green statements here, the read set and by the orange statements, the write set.

And finally, the [inaudible] approximated by looking at the control flow graph, so you will need to look, so here, the final emit statement and the [inaudible] record, how many times is it going to be called, right? Is it going to be called 0 to 1 in the conditional, is it in a loop, and so on.

Now, what are the challenges here? So the challenge is how to do this, first of all, correctly, so you don't want to make errors, and how to do this for [inaudible]. So the challenge comes from conditionals and loops. So basically, when you have different code paths. And there's a very simple solution to that. So in the previous example, we had these statements in a new six and seven, and they’re executed only if the condition line five is through, right? However, being that static code analysis, you have to be, you don't know if that's actually, if this code path is going to be followed. You have to be safe, and to be safe you can be conservative. In our case, we can prove if these read sets and write sets are bigger than what they actually are, then we’ll actually get a subset of the correct plans, but all of the these plans are correct. So we’ll lose an optimization opportunities, but we’re guaranteed to be correct. So the safe strategy here is when in doubt, you will always add an attribute to this read and write set, and we can probably get correct transformations.

Now having done that, you can do a couple of things. So first of all, you can [inaudible] strategies. So this is the PACT program that I had before, so the optimizer knows how to parallelize an operator from the second order function sequencer. So it knows that for a Map operator, for example, it's embarrassingly parallel. It can run [inaudible] in a new load. For reduce operator, you need to partition the database on the key, for Maps operator, you can do anything that the parallel database that’s for [inaudible], so you can, for example, do a perdition zone, you can broadcast one side [inaudible] so on. So based on this second order function signature, the optimizer can pick [inaudible] strategies. Then based on the read and write sets, the optimizer can actually loop, if the keys are preserved through functions, so it can identify, for example, the functions of reserves the key, so it will not have to do double work, to partition twice. So here, for example, is that we have done a partitioning on the same key needed by the Maps in the reduce, so when we’re at the time that we need to reduce, we don't need to do anything. So contrary to play my reduce, here when we run the run the reduce, it does not imply a physical source. A reduce can come from free.

The second thing you can do is actually reorder operators. So here, comparing it to the previous plan, I have reordered the reduce and the Maps PACTS. So basically, how you can do that is that you can reorder to Map operators, if you don't have any read-write, or write-write conflicts. So I will talk about that a little bit later. The thing is, you can do these things correctly, right? And these things are very important, so reordering apart from, you know, decreasing the data volume, can actually give you more opportunities for parallelization, right?

Or different options for parallelization. So in this example here, by pushing the reduce down, the reduce really, reduce the data of volume, you can get Suites that build in [inaudible], for example.

So very briefly, we can formally prove that we can reorder two Map functions if we don't have any write-write or read-write conflicts on their columns. Yeah? So only if you have read on the conflicts, this can be safely reordered. You can reorder a Map and a reduce function if that condition is true, and in addition to that, the Map function preserves the key groups, so the challenge there is that a reducer expects a certain number of records to be in the same group and that must be preserved. So, for example, if you're trying to reorder reduce with a filter, the filter must either eliminate a whole key group or it must preserve the whole group. So it cannot eliminate some records inside the group. Yeah? And in the paper, actually, we have, you know, these kinds of theorems for every pair of PACTS. So basically, the Map and the reduce are the basic cases, and then the rest of the PACTS are derivatives of MapReduce plus

Cartesian product in union. So you can do it very easily for the rest. And based on this, you can actually do things like [inaudible] selections, you can do zone reordering, so [inaudible] reordering, you can do limited things of [inaudible], so you don't have any semantics about degredation[phonetic] on the reducer, you don't know if it's a sum or something different, but

you can, if have some information about keys and [inaudible] keys, you can reorder drawings, and in addition to this, you can actually reorder reducers with drawings for reducers that you will not normally or easily write in SQL. So actually for programs that you will load, typically use SQL for, you can emulate most of the reorderings that the relational optimizer does. Good.

So the state of the optimizer right now is that it does the automatic parallelization. The ordering is in a prototypical state, so it was a prototype for the paper, so basically what that does is it enumerates all possible plan reorderings, and for each of these it finds the best physical plan. So for the evaluation, we implemented, so we can't code it for tasks in this PACT program model, so we wrote the Dell functions ourselves. Two of these are TPC-H squares, the third is a pipeline that this text mining, and the final is a simple query that does clickstream processing, that has, basically, a reducer that is not really a single [inaudible]. So the first thing we saw is that the study code analysis finds the full, the correct red and right sets for three of the programs, and so basically it found, using the study code analysis and enumerating all the plans, we cover hundred percent of the search space for three of those tasks, and seventy-five percent of the search base, the other task. So it is a lost analysis, but it mostly has good accuracy.

>>: Kostas? Can you be a little, what you mean by accuracy? When you say 75 percent, it's only producing correct results, it’s just not covering search space and that?

>> Kostas Tzoumas: So we only did correct results, right? There's no way you're going to get a wrong result. But we did was, so we run the study code analysis, and we also had coded what would be the correct read and write sets.

>>: So the problem is that it-

>> Kostas Tzoumas: [inaudible].

>>: And that’s because why, because it’s too large a reading?

>> Kostas Tzoumas: We found a superset of absolutely correct read and write sets. Right. So what we also found is that, you know, if you have an optimizer that's going to reorder operators, you can get benefits in execution time, so yeah. So in our settings, so this is in a very small class of [inaudible], over seven times different for [inaudible], we got to two times better plan for clicks and processing and 10 times better plan for text mining.

>>: Now query optimization is of course generating the plan space, but the other part is selecting the best plan, right? So do you have some [inaudible] cost model or do you do only conservative reorderings or what?

>> Kostas Tzoumas: So at this stage, we have a cost model that measures IO, a network IO, what it would not have is a good selectivity estimation component. Okay? So we're working on that. We have a couple of thoughts on how to do that, and this is the reason also that this is not [inaudible] in the resource operation. Okay.

So, some related works. In the bigger context of the bin, the bigger context of the big data analytics and query optimization, there are a few systems out there, I talked about a few of them. The technique that we use to do this study code analysis and reordering operators somehow resemble the clicks that were used in the [inaudible] compilers to do loop optimizations, so there's a small resemblance there. So concurrently to ours, there were a few papers out, so there was a paper from Wisconsin, on the data study code analysis for reduce functions to basically find if something is a filter that, and then it recommend a [inaudible] index. And we recently found out that people here in Microsoft have been doing similar things, so SCOPE actually does a study correlation of this functions. The difference is, from our work, is that you folks more, I guess, on the partitioning side and not on the reordering side, although I may be wrong here, so you know. [inaudible]. But it's interesting. And there a few things that they do that we would love to, so you can actually do a deeper kind of code analysis, so this read and write set is a very rough thing. So you can look for example, if the UDF preserves the sort order of the data set. Right and so on.

So this actually, so if you, so basically to summarize it though, so I talked a little bit about the architecture of Stratosphere. Stratosphere is a complete new system for big data. So it’s written from scratch, and it's sort of has the complete stack from processing to query language.

It's open source [inaudible] license that’s available, Stratosphere [inaudible]. And what I focused on on this talk was basically how can you deeply embed this MapReduce style userdefined functions in your query optimizer. The [inaudible] that we go, is so first of all, you don’t need full semantics. A few properties of that UDF are enough to do very powerful transformations. Second, you can accurately estimate this and safely estimate this using study code analysis, and as I said, we always correct. And you can emulate this, most of the transformation power over relation query optimizer.

A few weeks is a future work, so you can do as I said, deeper code analysis, you can do that transformations that actually go inside the UDF code, so that, for example, slash a function into two, or things like that. We are working in the context of the [inaudible], so we are working on doing this on writing our optimizer in a slightly different way, so we are working on an optimizer that does not try to find the best plan, but tries to find a plan that is good enough under a wide variety of circumstances. So the problem there is that in this cloud setting, you cannot possibly have accurate estimates. I mean even in the traditional setting, your estimates are usually very off, and here things are even worse, so what we're working on is to basically try

to find a plan that works well in that large part of this uncertain space. So robust query optimization. There are a few people that are working on adapting this plan [inaudible], so as the characteristics of the [inaudible], so as you get more information I’ll try and [inaudible] the plan. And Stephan, after me, will talk about how to do it derivative programs, so how not to

DAG’s, but cycle graphs, and how to embed them into the query optimizer. Some other interests that I'm having, and I’m have talk about is actually how to basically integrate all these big data platforms with the new stuff that is happening at the networking layer. So there's actually a big revolution going on in the networks right now with software defined networking and so on. And this is very, actually very relevant to us as database researchers, how to, you know, how to build that space. Good. So these are the papers, and thank you very much. I’ll be available for questions.

>>: We will take one or two specific questions here.

>>: So what version of MapReduce is this working for?

>> Kostas Tzoumas: So this is not built on MapReduce. This actually a new code base. So it uses [inaudible], the Hadoop file system to store the data and retrieve the input data so you can use that. But it doesn't have any common code with Hadoop.

>>: [inaudible] have you two [inaudible] or is it just going to have and allow different types of programming frameworks? Are you thinking about-

>> Kostas Tzoumas: Yes, we are thinking about that. We are thinking about [inaudible] so that people can try it out very easily.

>>: So, if I can follow up on that? So one key aspect of that [inaudible] data [inaudible] at the bottom was that it has a very flexible scheduling model and it was designed to run on different structures as a [inaudible], so it actually able to has just a single master and then allocate workers on the fly for its tasks and deallocate them on what parts of the data it was running.

So that was good core design and so at the moment our [inaudible] that’s trying to replace that connection to the infrastructure is a service controller with an interface to the [inaudible] manager so it's allocating that system so because it's [inaudible].

>>: So when you did that Java code analysis, so do you use any kind of Java code tools to do that or you write your own program to do that?

>> Kostas Tzoumas: So we use a tool. We didn’t write our own static code analyzer. The tool is called Suite. So it operates on three address codes and obstruction.

>>: Okay. Thank you.

>> Kostas Tzoumas: Thank you.

>> Stephan Ewen: Okay. So yeah. This talk is more or less going to seamlessly follow up on what Kostas did, so it's taking, it’s in the context of the same system, and as Kostas mentioned, it's basically the techniques we included to make the system an efficient one time for algorithms that are a different nature. And yeah. I don't think I have to do a big motivation for that because iterative algorithms are everywhere when it comes to machine learning or data mining, so the people that have played with Hadoop or with MapReduce iterative programs probably know that, I mean you can do it, you can always write a driver around a MapReduce program and invoke it time after time and it’s not doing great, and there have been modifications to try to overcome that. They’ve been doing a little better but still not great, so we are presenting the techniques that we've developed actually make that a little better. And the motivation for the whole setup is the [inaudible]. With Hadoop and a lot of those other general-purpose big data analysis systems not being great at running those machines running algorithms, iterative algorithms in general, there's been actually a variety of dedicated specialized systems for iterative data processing. So Google has published a paper about

Pregel, which is this [inaudible] processing adaption for graph analysis. This, of course,

[inaudible] Apache [inaudible]. There's been the GraphLab system by people from I think

[inaudible] and Berkeley and Connor G Mallon. And they're dedicated systems that try to overcome the problem that the general preface big data analysis systems are not really doing great at those iterative algorithms.

The problem that a lot of people are seeing today is if you're running that actually in a production setting, and assume you're doing things like training a span detector or a recommender on a periodic base, you typically have this; the data is not in the format that you can just throw it into your graph analysis algorithm. You typically have something like Hadoop, where it can replace Hadoop with Stratosphere or maybe your favorite data warehouse to actually extract the features of the data from your sources, transform into a format, probably normalize it, then you throw it into this specialized system for iterative data processing, then you take the results, you actually post process it, maybe map it back the keys you used in your production and so on. So actually have a rather complex processing [inaudible] plan. And this is basically all because the systems used for the first and the last step, which are typically general purpose data flow systems, or MapReduce or parallel database; I'm not really good at those iterative algorithms. So the idea was, if we could just make a, make the data flow systems good at this stuff in the middle, then you can actually simplify the whole process by a lot. And for a lot of reasons we think that's just making a data flow API, which will probably able to process iterative algorithms efficiently, is a very worthwhile thing.

So first of all, you actually see that a lot of those specialized systems are very specially tailored to watch certain use cases, and if you use a data flow API with some extensions those are very flexible. This is a base for a very flexible model, so as you will see in the later part of the talk, we can lay a custom API's on top of that very easily, they typically correspond to just a very special data flow, so Pregel is a special case of iterative data flow as we’re going to present it later. You can also compile DSL scripts to that very easily, or it can program directly against that like people are programming against MapReduce.

So, I mean, if you think about making data flows iteratively, the first thing you need is actually not so complicated. So let me motivate that with an example. This is an implementation of the

Page Rank as it looks like in Stratosphere so you can think of the Meteor or drawing because it's more commonly understood actually. And Page Rank is just an iterative matrix vector modification where they are, the vector is actually a vector of ranks for each page and the matrix is the, if you wish, the topology of the network with transition probabilities on the edges. And what to do in each edge is actually draw in the current rank vector with the matrix and then you aggregate the rank for the page. And the reducer. So, I re-represent that and the data flow, the rank vector is a pair of page ID’s and ranks, over each page, the current rank, the transition matrix as a pair of triples. The matrix actually very sparse. A lot of pairs of pages don't have edges between them. So the transition probability is zero and we just skip the entry from the matrix to represent it in the sparse way so the matrix is here just a set of triples, source page ID, target page ID, and some transition probability. The algorithm is just joining them and aggregating that.

Okay. So if you want to make that iterative, you can always say, okay, let's just run that one again and make sure that the next situation consumes the output of the previous iteration. It can emulate that by just renaming the input files or so, or you can just say okay let's include a

[inaudible] a physical feedback action in the system. So that's actually pretty simple idea and it helps a bit, but if you run that repeatedly, you’ll actually see that it does a lot of redundant work. For example, it's going to do all the full processing of the drawing of that input, which it actually doesn't need to. That is something that has motivated some of the specializations of

Hadoop for iterative processing. For those of you are familiar, for example, with the Haloop or the twister system, what they did was, for example, they gave you the ability to say, okay, here's a loop invariant data set for the reducer that would then be partitioned and sorted only once and then cached in that variant.

So that is something that's very actually very neat, but of course, we have our general purpose data flow system, we have an optimizer, so let's just do it, I have the optimizer to do that. And the thing to do that is actually very simple. You just have to identify in this data flow the difference between a dynamic path, which is everything that is basically a successor in the cyclic

flow of the feedback edge, and the static path which is basically everything that's not a successor. We enable the runtime to actually, basically cache the data set where the constant and the dynamic data path meet. And caching is just something like placing a temp table or actually, it doesn't really need to be a temporary table, but if that drawing is, for example, using a hash table, you cannot just make that hash table persistent across alterations. On something like that. And we just gave the optimizer a notion of this constant, a dynamic data path and gave the costs a little different weight; so the costs of the dynamic data path are weighted more than the constant data path, and what comes out of that is actually, immediately very nice. So we’ll see that it comes up with plans that just try to push work into the constant data path and cache the result. So those are two different [inaudible] execution plans for this paging algorithm, the [inaudible] people will immediately see the difference between broadcast join and partition join. The same thing is, if you look at those two variants, they actually, two very famous hand optimized variants. The left one, the one that has been in my hold for quite a while, optimized for computing the page rank of setups where the rank vector actually can be broadcasted, and the right one is the one that's close to the execution that Google described in its MapReduce paper. So the optimizer just gives you those two variants depending on the size of the rank vector. So that's all fine. That's an easy win.

Okay. That's the summary. So all you need to do is actually make the optimizer aware of feedback edge of the constant data path, dynamic data path, and different weights. So, there's a class of algorithms that is still doing kind of sub optimal in that set up. And yeah. Let me motivate that with a very simple algorithm here. That is an algorithm taken from the Pegasus paper, Falusa’s paper on graph analysis. It’s an algorithm that finds connected components of graphs, and it does that in a very simple way by saying okay, each vertex is assigned a component ID, and after the algorithm is run, all vertices that are in the same connected component should end up with the same connected component ID. So this algorithm iteratively just takes the component ID from each vertex, tells it to all of the vertex’s neighbors, and each of the neighbors just says okay, if I find an ID that is smaller than my own current one, then I’m going to take that one, and I go on. So you can actually see that after a few iterations, all vertices in the same connected component end up with the idea of the vertex with the smallest ID. And I've colored the vertices in that example a little differently, and the reason behind that is the following: not in every step will every vertex end up with a new ID. And if a vertex doesn't end up with a new ID, then basically for that vertex, nothing changed, so there's no need to really take the information of that vertex and incorporate it into the next iteration, right? So this is a phenomenon that is often called Spark’s computation of dependencies; it's especially present in graphs, but also in a lot of other problems that can be modeled as graphs, so you really want to take those into account.

You can always run algorithm like that in the fashion as we had it before. So you can model this connector component algorithm actually in a very similar way. You can model differently, where I just modeled it the same way with the join in the reduce, drawing again between the vector that represents the state of the vertices. In this case, just vertex ID and component ID, and the matrix is now even simpler, it’s just the edges. You again draw in, joining meaning every vertex until basically the end of its outgoing edges about its component ID and then do a minimal aggregation of all the candidates. So if you run the algorithm like that you'll actually see that it is a constant a lot of work in every iteration, right? Because it fully recomputes the entire next state. It consumes the previous model entirely and it recomputes a complete new version of it. And it will, in a lot of case, end up re-computing just the same state as was the input to the iteration.

If you look at the green curve, this one is actually the number of vertices that really changed the iteration, so that contributed something or that would contribute some new information potentially to the next situation. This is the ones that were colored red in the previous example. So this is very true for many algorithms that you see actually that the number of vertices that contribute is going down after a few iterations. That is not only true for something like connector components or finding shortest paths, but it's also true for modified versions of page rank that try to adaptively schedule the computation. The difference in the work you do in iteration is, of course huge, so exploiting that is very desirable.

So how do we actually do that in our system because it’s a property of data processing that's actually that each operator’s result depends only on its inputs, right? They're kind of a natural match for this bulk version of iterative processing, and this was the main, one of the main motivators, actually, for Google's Pregel system because they can really keep the stage of a vertex across iterations. How can we try and capsulate that still in a very generic fashion to embed it into data flow? One thing was that we said, okay, we need two representations of data flow analysis. We have the bulk rank presentation, which is the one I just told you before, the rather straightforward one, and we have an incremental type of iterations, where we try to encapsulate exactly that nature that only parts of the model change. You’ll later see that there’s actually a special case of those incremental iterations that you can even run without the typical super step synchronization which you have in most systems after each iteration.

Okay. So the abstraction that we picked for representing that in a data flow is called a work set, set algorithm abstraction, and work set algorithms are technique actually from compiler construction and it’s used there for optimizing loops, and so that was actually, intuitively a good match. It's used to optimize loops to have them incrementally compute the next element that they work on. And you can think of that actually, as the, if you know [inaudible] evaluation and recursive computation, then works at algorithms is on the same level for iterative computation.

So it's doing, it's a very similar thing. So, what we are trying to do is, what we’re giving here, the system is an abstraction that is not working on the solution set, on a single set that it’s recomputing every time, but it's actually working on a work set that it’s recomputing every time and instead of computing a new solution set, it’s out of the work set computing a Delta for the solution set and then using this data on the previous work set to compute a new work set. So you typically model the algorithm in that way that the work set is candidate elements that go into the Delta, and with the Delta, with the previous work set, you compute candidates.

So if you want, here's a brief illustration of this connected components algorithm in the work set, this is our solution set really, the state of the vertices, and the work set is the candidate component ID’s for the work set, so I modeled it like this, it's basically grouped as it goes into the reducer by the target, target vertex, or vertex one here really gets from vertex two, candidate component ID of two, from vertex three, candidate component ID of three, and similarly, vertex six gets candidate five from vertex five. So, what you do with that is you actually compute a Delta for solution set, which you merge into the solution set, and then you take those two to compute a new work set, with that you actually get a new Delta, you merge into the solution set. And you actually see the work set is typically going down so the termination criteria for work set algorithms is really the empty work set actually. It's for recursive algorithm [inaudible] evaluation. [inaudible] is added.

Okay. And this really encapsulates this dynamic computation scheduling. So if we want to represent this as an iterative data flow, we actually see we have a feedback edge from the next iteration’s work set back into the system and a Delta, which goes as not really as a union but it is a modified union operation system back into the solution set. And this algorithm is now very similar. It’s just the joiners basically still the drawing you had in the algorithm before between the edges, and in this case, not the entire solution, but basically, what came out of that operation, which also goes into the Delta, and this is the reducer, so the code group, as Kostas mentioned as a [inaudible] reducer, which is in this case, not reducing entirely on the work set, but also comparing that to the solution set to actually compute a proper Delta. And yeah. So interestingly, you can more or less take this plan as being a template to implement any

[inaudible] algorithm. So, as I said before, if you use such a more flexible abstraction, you'll actually see that a lot of the specialized systems are just special case data flows on top of that abstraction. So this is basically the abstraction to the Pregel on top of that.

There's, I mentioned also earlier, which is, might be interesting to note that in some cases you can illuminate the superstep areas here. So this is the same plan as we did before for this incremental algorithm, but the difference now that we don't use a code group here, remember code group like reduces a group at a time operation, where you place that [inaudible] by a couple at a time drawing an operation, and so what we're doing in this case is really were taking

out the entire work set and combine it with the solution set that we are taking one element at a time out of the work set, if you wish. And comparing that to the, and taking this candidate and seeing whether it actually should affect the solution set. And the reason why we can do that is because in this special algorithm we actually have an [inaudible] operation in here, basically selecting the minimum one, the smaller one of the current state of the solution set and the new candidate ID. And because you don't need to do this on the whole group at a time, you could just do that on the elements individually. So if the algorithm actually has item potent

[phonetic] or doesn't even necessarily have to be item potent but also the weaker conditions which hold, if the algorithm fulfills the condition, you can actually often transform the group at a time operation, so top letter time operations and then if you have only top letter time operations you can eliminate the super step boundaries and let the algorithm run in basically in unsynchronized fashion, more or less.

Okay. One last interesting thing that I want to point out here, is that with that abstraction, you can actually do algorithms that, for example, Pregel cannot do. So I said earlier that there's an interesting variant of the Page Rank algorithm, it’s call adaptive page rank, which tries to exploit the dynamic computation and build on the fact that actually the pages that up with a very small rank typically converge very fast in the small rank, and most of the later iteration actually work only on the, both on the high rank pages and on the pages that take actually special loads like hops or so between density connector components. The ones actually implemented on Pregel just in the very hard fashion, but in this case is just another special case of data flow. So it's a pretty versatile abstraction which allows to do quite a bit.

A few performance numbers on how we did with that. So we compared our implementation of bulk algorithms, which is the first version, just with that naive single feedback edge, with an implementation of the iterative algorithm and also here with the elimination of the super step boundaries. We compared that to Giraph, which is this specialized system for graph analysis and Spark. So Spark is a pretty neat implementation actually, for dataflow system that is very good at doing those Spark attractions. And we ran that on four different graphs, and so one thing that's actually visible directly is that the systems that can exploit these sparce computational dependencies are always faster than the other ones. It seems that our by processing and Sparks by processing, are, let's say, the same order of speech, and the incremental processing is actually able to, it’s actually competitive with the specialized implementation on Giraph . On the larger data sets, we actually could only do an implication of the algorithms against for our own, between our own different modes of execution. And the problem is the following: so our cluster is actually not that big, some of the data sets are actually pretty big, so with the generate dcandidates and so on ,you actually create intermediate data volumes that are exceeding the size of the main memory and because this is really just a special case of a data flow with all the algorithms implemented in such a way that

they operate well under memory pressure, I mean the join just [inaudible] disk [inaudible] becomes external and so on. It just worked, whereas both the Giraph and Spark actually being, being actually special tailored and not building on top of the previous existing implementations of those algorithms as they're known from literature and those systems were actually not able to handle that, at least at their current point of implementation, that kind of implementation state. They were not able to handle those larger sets.

>>: Why is it the bigger versions are more expensive than purple? What is the purple one?

>> Stephan Ewen: This one here?

>>: Yeah.

>> Stephan Ewen: This one is the one without the super step boundaries. This is the one with super step boundaries. So, okay. Why that one is more expensive. Let me go back to that plan here. Okay. So if you look at the way this one executes, it's basically taking the smallest of the candidates, comparing that against the current component idea of that vertex, and only if that one is smaller, it puts it into the Delta, and into the drawing that creates the new candidates. If you look at this one here, maybe this is taking at first only second smallest one, updating it with the second smallest one, and then it's creating candidates based on the second smallest value, and then it's actually updating the state of the smallest value, so it's creating new candidates based on the smallest component ID. So the candidates based on the second smallest component ID will actually have no effect in the end. But they’re still actually processed in the system. So, depending on the structure of the graph, they may sometimes be more expensive it may sometimes be cheaper. So this is something we actually saw here that can make a difference sometimes.

In general, so we have to actually, this is some of the follow-up work. We have to investigate that a little more. The ability to run algorithms asynchronously actually opens very different realm of doing it, which we have not yet explored. We assume actually, that they are more robust to processing in the presence of, let's say, stragglers and slow it down nodes because we don't have to globally synchronize at each step, but that is something we actually currently exploring. So at this point, I just want to point out that it's actually possible to run that model.

The true benefit of that is something that we are currently investigating. Okay.

A bit of related work on that. So I mentioned before that they were adaptions of Hadoop and

MapReduce in general to make it run iterative algorithms efficiently. And the optimizations they put in there are actually mostly [inaudible] by the optimizations but the optimizer does here. There's the bulk [inaudible] parallel processing model. Its adoptions are Pregel and

Giraph, which actually special cases of the incremental iterations that we showed here. An interesting parallel work on that was by people from the former Yahoo group, so [inaudible]

Group, which is now here, and UCI, my Kerry’s Group, which they came on to, they stumbled into very similar observations of problems, and what they did is they tried to model it exactly in the recursive way, so recursion and iteration is theoretically very powerful so you can built from both actually, an abstraction for that. They compile recursive definitions of the problems down to data flows which is actually possible. At the moment, I think actually the abstraction that we have here might be somehow simpler even a bit because in the recursive [inaudible] expression you have to include certain tricks like temporal variable structurally, say okay, this [inaudible] derivation is actually a version of the old derivation and so on. But it's in general, equally powerful. And then there's the specialized system GraphLab, which I think, is of all the specialized systems, the most interesting one. And so you can actually model the GraphLab processing almost as a special case of the parallel dataflow. The only thing that GraphLab at the moment does, so GraphLab allows to work asynchronously, which we also do, but they have a special way of being able to relax consistency constraints because a lot of the algorithms are numerically stable enough to still work under relaxed consistency, which we don't allow at the moment. So, but there's we actually looking at, how it’s actually possible to tell a system that only reacts consistently is necessary here.

Okay. So, yeah. That's the second part of the talk, and to conclude it, actually, integrating iterative algorithms into a flow is a very worthwhile thing because it can actually subsume[phonetic] a lot of specialized system and can lead to a homogenous data processing pipelines and exploiting Spark’s computation and dependencies and dynamic scheduling is something that is very important to do and works at algorithms, it works at abstraction, it’s actually a way to do that without violating the principles of dataflow programming like side effect free operators. And if you integrate that properly, you can actually end up being very competitive to specialized systems. Okay. So that's it. Thank you.

>>: Two questions. What is the optimizer's role in the duration, so do we optimize different duration [inaudible] different plan or it’s a static plan?

>> Stephan Ewen: Yeah. Okay. That's very interesting. The feedback actual model doesn't necessarily really have to be a physical feedback channel. It can just be logical and you really unroll the loop, basically lazily in the processing, and that would actually enable you to use different plans in each iteration. So we’re not doing this at the moment; we are at the moment optimizing the expected costs for the first iteration. There are, of course, cases where the size of the partial solution changes, especially if you have like, inflation every fixed points or so. And it's definitely worthwhile to look at that. We have not done that at this point. So the system is theoretically able to have that, but we've not yet done the incremental optimization aspect of it.

>>: [inaudible] setting of the [inaudible] and you cache the [inaudible]? So how did you revoke, how did you restart the computer thing that is in cache?

>> Stephan Ewen: So the cache is basically at the end of the static data path. So, okay, at the moment, and what we're doing is when a node goes down, we're assuming that we can actually allocate a backup node for that. So what we are not doing is if we start on 10 nodes and we hashed everything into 10 partitions, then we do a rehash into nine partitions or so because that would actually mess up the cache. That is true. As soon as you can allocate a new node, which actually takes over that partition, then you actually do forge tolerance recovery just as in the noniterative part because it’s part of the static data path. You redo some work. Actually doing recovery t and flow tolerance in the system operators have not the blocking property it’s a very interesting problem of its own, on which the people in our research group that our building the data flow engine are working on, so I can tell you a little about that off-line if you are interested in that, but actually, this works very transparently for this part. With the limitation that you need a back up node. You don’t redistribute it among a smart set of nodes.

At the moment.

>>: [inaudible] I’m trying to understand what is the role in this case, for the optimizer? Looks like the optimizer is just at the cache [inaudible]?

>> Stephan Ewen: The optimizer is just, basically what exactly Kostas described, and in addition it plays and uses heuristic to place the cache, and as I said, the optimizer doesn't really need to be changed all that much except that you need to tell it to identify the dynamic and constant path and actually use different cost ways. Everything else is almost given for free because it's the regular operation of what the optimizer does anyways. So, I think a slight difference in the way that the optimizer works at the moment in our system compared to the classic [inaudible] style optimizer, and that is that interesting properties when they’re propagated top-down are not only used actually in the pruning phase where you say, okay, I keep a plan that is likely more expensive [inaudible] if it fulfills an interesting property, but it's really also used to generate candidates that actually have realized strategies further down the plan than they're actually needed. So, is this the way how the work actually ends up for the down below on the constant data path. But that’s the only difference, I think, to the classic volcano[phonetic] model.

>>: [inaudible] based on heuristic, I assume that it’s probably, you have to identify the group and then have something heuristic to consider whether to have cache or not cache?

>> Stephan Ewen: And so at the moment we actually cache always at the point where the constant or dynamic data mean and we just place the cache either, yeah, you see in that case actually, after is the built hash table, so it caches the hash table as a whole, the hash table just

becomes persistent across nodes and yeah, this is a probing operation. I mean, there's, this is not really an operation, it's just added in this visualization to illustrate what the plan does. The cache is actually after that latest operator that really changes the data in this one too.

>>: [inaudible] optimizer, optimizer plan as it would normally be. So you have to give it a little bit more weight [inaudible]? But increasingly you can do things putting things out of iteration

[inaudible]?

>>: [inaudible] however, it’s, there’s heuristic that I use for generating the planned candidates and those heuristics with these [inaudible] cache operation. I think Stephan just gave simpleristic, which is the dynamic data path here comes in and meets the static data path that’s where are cache operators-

>> Stephan Ewen: So I'm not sure if you're maybe referring to optimization, which things like this magic transformation which try to actually push its features from a somehow successive iteration to previous iteration and so there's, I mean this is something that you could add. This is has been added to the optimizer over the relation of the database. And the relation of the database, if you wish, one optimizer’s similar to this one as well. The problem that we’re always having is that you're all operating on the stricter set of semantics orders, the smaller set of semantics, or sometimes the optimizations are a little harder to apply, but if you follow up the project, in the next you'll actually see that we’re moving to a semantically richer model. So

I think then it would be able.

>>: Iteration brings up quite a lot of interesting optimization issues. Lots of things that you can do.

>>: All right. Any other questions? Not related to optimization? All right. Thank you Stefan.

>> Stephan Ewen: Thanks for listening.

Download