1

1 >> Rich Draves: All right. So pleased to see a big crowd. Here to welcome Matei from Berkeley. So Matei, of course, is a candidate, interviewing with a number of different groups here today and tomorrow at Microsoft. Matei is well-known for his work in sort of big data or cloud computing or cluster computing, depending on how you look at it. We're building systems, ranging from ESOS to spark and systems which have actually achieved, already achieved a significant adoption industry, which is unusual for many graduate students. Matei has also had a lot of success on the academic side. Last year, he was fortunate to win two different best paper awards, Sigcomm and NSDI. His interests are wide-ranging. In addition to the sort of big data, cloud computing area, for example, he's also worked with folks here at Microsoft in things like gene sequencing, made some significant advances there. So let's welcome Matei. Look forward to a good talk. >> Matei Zaharia: Thanks for the introduction, Rich. So I'm going to talk today about big data analytics, and basically bringing that to new types of applications that need to process data faster than before. And so feel free to ask questions and stuff throughout the talk. I think I don't need to talk really about the big data problem at Microsoft. Everyone here is probably very familiar with it. But basically, the problem is in a lot of domains, not just web applications, but also things now like scientific instruments or gene sequencing or things like that, data is going faster than computation speeds. And basically what we have are these data searches that keep producing it. Either you have many more users using your application on mobile devices or on the web, or you have scientific instruments like gene sequencing instruments or telescopes that are speeding up faster than [indiscernible] allow. And you have cheap storage so you don't ever throw away the data. You just buy more disks and store it. But you have stalling clock rates and it's getting harder and harder to work with it. So as a result, people are now running on a very infrastructure. They're running applications on And that's kind of the only way to actually deal of that, people are also adopting a new class of different type of these very large clusters. with this data. And because systems to work with these 2 clusters. So just as an example, Hadoop MapReduce, the open source implementation of MapReduce started out at a bunch of web companies, but it's now used at places like Visa and Bank of America and a whole bunch of places in more traditional enterprises. And it's a growing market. It's projected to reach $1 billion by 2016. So for people doing research in systems, this space is both exciting and challenging. So first of all, it's exciting because these large clusters are a new kind of hardware platform with new requirements and basically whole new software stack is emerging for them. And there's a chance to actually influence this software stack. Often in most systems, maybe less so at a company like Microsoft, but in general in many areas of systems, it's hard to go from, you know, doing some research to seeing it actually being out there and seeing what happens if people try to use it how it actually works in practice. But in this space, this is a space where people are adopting new algorithms, new systems, even new programming languages to deal with this kind of data. So as a researcher, there's a chance I can each get things tried out and see how they work. At the same time, though, it's also challenging. This is because apart from just the large scale of these systems, which has its own problems, the demands on them are growing. So in particular, users have growing demands about performance, flexibility, and availability. By performance, I just mean they want to get answers faster, even though the amount of data is growing, or they want to move safe on a batch computer model to streaming and closer to hill time. By flexibility, I just mean more types of applications that they want to run in parallel with different requirements. And by availability, I mean basically high availability. So when you're running MapReduce every might to build a web index, it's okay if you miss it one night. But when you're running it to do fraud detection in close to real time and that breaks, you're actually losing money. So people want all these things out of big data systems. So my work has been a software stack that addresses two of the core problems in this field. And these are programming models and multi-tenancy. So for programming models, there's been a lot of work out there on batch systems, which are really great 3 for making this data accessible. But users also wanted to support interactive [indiscernible], complex applications, things like machine learning algorithms and streaming computation. And these are the programming models I worked on. For multi-tenancy, one of the major problems that happens when you have these clusters is they're larger and they're shared across many users. So many of the problems that happen in a traditional operating system, with multi-tenancy, you know, kind of the mainframe type operating system are happening again here and you need algorithms to share these clusters efficiently between users. So this is the stack of systems I worked on. Basically, at the top, the things in blue are parallel execution engines. I'm going to go into more detail, but the Spark is the underlying engine. And on top of that, we built a streaming engine, Spark streaming, and also something called Shark, which does SQL. In the middle, Mesos and orchestra are two systems for resource sharing. Mesos is for resources on the machines, like CPU and memory. Orchestra is for sharing the network among parallel applications. And the ones at the bottom, these are a bunch of scheduling algorithms I worked on that tackle different problems that happen in these datacenters, like fairness for multiple resource types, or data locality or stragglers. And these are both about determining the right policy and also coming up with efficient algorithms to deal with these things. So the work at the top addresses the first problem of programming models. work at the bottom is about multi-tenancy. The In this talk, I'm going to focus mostly on the top part, but then I'm going to also come back at the end and talk a little about one problem here, because I think there are some cool problems with algorithms and policies as well. So let me just start with some really basic background on this. I think people here probably mostly know this. But basically, when you're running in these large datacenters, there are really two things that make it hard and that are different from the previous parallel environments people have considered. And these are failures and stragglers. So the problem with failures is that anything, you know, that can go wrong, you know, fairly early on a single machine will start happening a lot more often on a thousand machines or ten thousand. So if you have a server, for example, as 4 a mean time between failures on a typical server might be three years, and you put a thousand of those, now your mean time between failures is a day. And if you put, you know, 10,000, something's going to fail every couple of hours. So that's one problem. Stragglers are actually an even more common thing, which is a machine hasn't just outright failed but for some reason it's slow. Maybe there's a component that's dying, but it's not actually failed yet, like a disk, and it's reading really slowly. Maybe there's contention with other processes. Maybe there's a bug in the operating system or the application. And the problem is you're doing this parallel computation on a thousand nodes, but if one of them is slow and everyone waits for that, you know, you're losing all the benefits of that. So Google's MapReduce is one of the first systems to handle this automatically. And the thing that was interesting about MapReduce was not really the programming model, but just the fact that it did these things automatically. And basically, you know, if you haven't seen it, the point of MapReduce is just that there are these two phases of map and reduce tasks. You read from a replicated file system, and you build up this graph of processes that talk to each other. And if any of them goes away, the system knows how to launch a new copy of that and splice it into the graph. Or if one of them is slow, it can launch a second copy and splice it in. So MapReduce was really great for batch computations, but it's only this one pass batch computing model. And I found in talking to users is that users who put their data into these very quickly need it to do more. And in particular, users need it to do three things. They need it to run more complex algorithms that are multi-pass. Things like machine learning or graph computation that go over the data multiple times. They need it to do more interactive queries. So it's great that you can, you know, sort of cull the web and build an index in a few hours every night, but now if I have a new question, can you answer that question in two seconds or do I have to wait two hours again. Or do I have to build something. They also want it to do real time stream processes. So you build say a spam detection classifier and you train that every night, now can you train it, you know in real time as new spam messages come up. Move that application into something you can run in real time. 5 So one reaction to these needs is that people build specialized models for some of these applications. So, for example, Google's Pregel is a model for graph processing. There have been a lot of iterative MapReduce. There's a system from Twitter called Storm that's very popular for streaming. But there are two problems with this. First of all, these specialized systems only come up one use case at a time. So if you have a use case that is still a complex application, but maybe it's not graph processing, the systems out there might not be good for it. And the second one is that even if you have models for all the things you want to do, it's hard to compose them into a single application. And in sort of the real world, many users want to start by doing SQL-like query, you do something that's a graph. Now you want a graph algorithm on the result. Now you want to MapReduce or, sorry, machine learning algorithm on the result of that. And with separate systems, it becomes hard. So our observation behind all this work is that these complex, streaming and interactive apps actually all have a common need, and there's one thing they need that MapReduce lacks, and that thing is efficient primitives for data sharing. So these are all applications that actually perform data sharing between different parallel steps, and that's the thing that would make them work better. I'll show a couple of examples. This one here is iterative algorithm, and this is a common pattern in many algorithms. You can take that. That's actually my coat. Yeah. So okay. So yeah, so this is a common thing. For example, if you imagine something like page rank, page rank is basically a sequence of MapReduce jobs. If you run this on something like just Google's MapReduce or Hadoop, the problem is that between each job, you're storing your state in the distributed file system. And just reading and writing to the file system is slow, because of data application across the network and also because of disk IO. So it just makes it slow. Another case is interactive queries. So interactive queries, you often select some subset of the input that you're going to ask a bunch of questions about, and so all these queries share a common data source. And I'm going to come back to streaming later, but streaming also involves a lot of state sharing as well, because you maintain state across time in the computation. 6 So these things, if you just run them with MapReduce and sort of the Google like stack, they're slow because of the replication and disk IO that happens in the storage system. But those two aspects are also necessary for fault tolerance. So that's why the file system replicates data. >>: [indiscernible] isn't a constant. replicated files. You see system that use single >> Matei Zaharia: Yeah, that is true. You can have intermediate files that are single replicated, but if you have -- so, for example, in this case, it's hard to do that because you don't even know which queries you're going to ask in the future. So if you have a thing like [indiscernible] where you submit a whole graph at once, you can do it. But here, the abstraction that you see as a user is just, you know, I can make files and then I can run MapReduce on them. >>: So you're saying the system doesn't know if it's a [indiscernible] file or not? >> Matei Zaharia: Yeah, exactly. It's just that there's no -- yeah, there's no explicit abstraction across parallel jobs, yeah, for data sharing. >>: Okay. >> Matei Zaharia: But what I'll talk about is definitely, you know, based on what people do when they do know the feature graph. Okay. So our goal with this was to -- can we do this sharing at the speed of memory, and the reason to do it is really simple, is because memory is easily 10 to 100 times faster than the network or the disk. And if you think about it, even if you have a very fast, full bisection kind of network, say you have 10 gigabit or 20 gigabit ethernet such as in [indiscernible] datacenter storage, that's actually still about a factor of 20 or 30 slower than the memory bandwidth in a machine. In a machine, you can easily get about 400 gig bits per second of memory bandwidth. So that's why we wanted to do this at the speed of memory. But the challenge there is how do we actually make it fault tolerant if we just said that the disk and the network are things we can't push data over. So there have been a bunch of existing storage systems that put data in memory, 7 but the problem is neither of -- none of them actually have this property of doing everything at memory speed. And the reason why is because they're based on this very general shared memory abstraction. So basically, these systems give you abstraction of a mutable state that's sitting out there and that you can do fine-grained operations on, like reads and writes to a cell in a table. And these include things like databases, key value stores, ram cloud, file systems, all these kinds of things. That's the abstraction they provide. And these all require replicating the data or, you know, things like update logs about what you did over the network for fault tolerance. And we just said that replicating it is much slower than the speed of writing to memory. So the problem we looked at, then, to deal with these is can we provide fault tolerance without doing replication? And we came up with a solution to this called resilient distributed data sets or RDDs. And basically, RDDs are a restricted form of shared memory that makes this possible. So RDDs are restricted in two ways. First of all, they're immutable once you create them. So they're just partition collections of records you can write them on [indiscernible] immutable. And second, you can only build them through coarse-grained, deterministic operations. So instead of building these by reading and writing cells in a table, you do something like apply a map function to a dataset or apply a filter or do a join. And there's all kinds of operations you can do in that. Now, what this enables is to do fault recovery using lineage instead of replication. So instead of logging the data to another machine, we're going to just log the operation we did on it, and then if something fails, we're going to recompute just the lost partitions of the dataset. So just to give you an example of what this looks like, here's some operations you might do with RDDs. So maybe you start with an input file that's spread across three blocks, three different machines. And maybe you start by doing a map function. So you give a function F that you're going to apply to every element. So now you're going to build a dataset, this is an RDD. And basically, the circles there are partitions and the whole thing, you know, is an RDD, is a dataset. And this is not going to be replicated. There's just one partition sitting on each machine. 8 You might then you do another you pass it to these parallel >>: do, for example, a group-by. So [indiscernible] function G and operation function on this data. And we might do a filter where function H. This is how you've built your dataset. You've done operations. [indiscernible] deterministic? >> Matei Zaharia: Yeah, we do require them to be deterministic. That's an assumption we're making, yeah. Okay. So that's what you get. And now, if something goes missing, you can look at this dependency graph to rebuild things. So, for example, if this guy goes missing here, we can rebuild it by just applying H to this partition of the parent dataset. And we can get it back. Even if multiple chunks go missing, you can go ahead and build them again in a topological order. The other thing this doesn't show but it actually matters a lot, in practice, is in practice, on each machine, you're going to have many different data partitions. And when a machine fails, you can rebuild the different partitions in parallel. So the recovery process can often be a lot faster than the initial process of computing this thing. And that's what makes this recovery quickly as well. So next question with this is how general is it? So we just said we're going to limit shared memory to these coarse-grained operations. And we found out that actually, despite the restrictions, RDDs can express a lot of different parallel algorithms that people want to do in practice. And this is because just by nature, data parallel algorithms, the algorithms apply the same operation to many data items at the same time. So this strategy of logging the one operation that you're going to do to, you know, a billion items, instead of logging the billion results, makes a lot of sense in that setting. And, in fact, we showed that using RDDs, we can express and unify many of the existing programming models out there. So we can express the data flow models like MapReduce and Dryad that kind of build a single graph like this, but we also found some of these specialized models people propose, such as Pregel or PowerGraph from CM you or iterative MapReduce can be expressed using RDD operations. By this I don't mean just kind of the incompleteness argument of 9 like we'll get the same result, but we're also expressing the same data partitioning across nodes and controlling what's in memory and what isn't. it's really going to execute in the same way with the same optimizations. So And we also found that we could do new applications that some of these models couldn't. So if you look at this in kind of a trade-off space of parallel storage abstractions, this is what it would look like. So you can have the trade-off between the granularity of updates the system allows and the write throughput. And basically, things like K value stores in memory databases, ram cloud, allow very fine /TKPRAEUPBed updates, but their throughput is limited by network bandwidth, because they replicate the data. Things like Google file system actually, they're not really designed for fine-grained updates. And despite that, people have run lot of algorithms on them because of the data parallel nature of the algorithms. But they're still limited by network throughput. And RDDs are instead limited by memory bandwidth. >>: So one thing that this is not showing is the cost -- or the speed of recovery, right? >> Matei Zaharia: >>: The speed of recovery, yes. So if you had [indiscernible] data, you would recover instantaneous? >> Matei Zaharia: That is true. So here, there will be a cost to recovery. I'll talk a bunch about speed of recovery later on, though, yeah. >>: [indiscernible]. >> Matei Zaharia: So you don't have to have RDDs in memory if you don't want. You can have them on disk. It's still -- or on SSDs, you know, and it will still save you from sending stuff over the network. So in our system, actually, the system's designed to spill gracefully to disk and to keep doing sequential operations if you do that. >>: I'm a little confused by [indiscernible] by keeping things, it seems like you're [indiscernible] disk versus keeping things in memory. That's one thing, versus doing things locally on a single machine versus reducing from many 10 machines. >> Matei Zaharia: >>: Yeah. So you're limited by the network, not by the speed. >> Matei Zaharia: That is true, yes. You're right. mainly for the network that we're doing this. Yeah. It's actually, it's I mean -- >>: Seems like that significantly changes the semantics of the computation. In other words, you're saying it would be faster by not running a distributed algorithm, just by running an algorithm that's ->> Matei Zaharia: But it's not going to be local so we still have operations across nodes. If you go back to this guy, so [indiscernible] buys an operation across nodes. What I'm saying is just when you write this, like if you created this dataset with a MapReduce, for example, you would be writing this to a distributed file system, and then the next job, like say you didn't know you want to do a group-by next. You save this result, you wrote it out to the file system and then you come in and do a group-by. >>: In the graphics in the next slide, you showed that the network band was limiting where your [indiscernible] if you're doing a group-by or something ->> Matei Zaharia: Then you are, that's absolutely right. This is after you've got the data grouped the way you want when you actually are doing the write to this storage system, yeah. So the applications, of course, so yeah, this is just, you know, after they've computed the reduce function or whatever, they are writing it out. Applications can definitely still be network bound if they can communicate. Yeah. >>: [indiscernible] it would be reconstructing a failed node. Seems like I'm trying to understand what MapReduce does. Does MapReduce, as you play it out, also ->> Matei Zaharia: Yeah, yeah, it's similar. So basically, what we took is we took that kind of reconstruction and put it in a storage abstraction. So it persists across things -- across parallel jobs you do on that, rather than being just on one job. But it's definitely inspired by the same thing. 11 And I think the cool thing is actually like how we do this with streaming, which I'll show later, which is -- yeah. >>: So for jobs that are actually network-bound, does this approach only produce a latency? Seems like if a job is network-bound, you're reading to a disk, then you get a point you're saturating the network, say. >> Matei Zaharia: Yeah, it depends on how much intermediate data you have. So if your network -- if you don't have a lot of intermediate data, you're just doing a big shuffle, then this isn't going to matter. But we found in a lot of jobs, even in things like page rank that do a significant amount of shuffling, and have a lot of state, this can help. Yeah. >>: So if I understand this correctly, where you're really saving on network [indiscernible] for intermediate state is by not replicating it [indiscernible] and over three times? >> Matei Zaharia: Exactly, yeah. And yeah, and also that can be -- yeah, it can be a significant fraction of the job running time, yeah. >>: Can't you, with systems like Hadoop, control the [indiscernible] at the end? You can say that you only want ->> Matei Zaharia: You can definitely control it, but then the problem is if something fails, you're lost. Like Hadoop doesn't keep track of oh, I did this map function before you rebuild it. So what we're doing is pushing that information in the storage abstraction, yeah. >>: You're basically using some form of -- a form of logging to rebuild as opposed to ->> Matei Zaharia: >>: Yes, exactly. And paying more for it [indiscernible]. >> Matei Zaharia: Yeah, yeah, you may have -- although it actually, well, it depends on what you're computing. But if you compare it to the cost of like having to always replicate the thing, that's a fixed cost. So yeah, depends a lot on your failure assumptions, yeah. 12 >>: A really quick note. There's another metric there, which is failure, things like lead time to failure. [indiscernible]. >> Matei Zaharia: So there are cases where you want to do application instead. Yeah, it's very true. And actually, I'll talk a bit about this in streaming also. You can still combine this with applications sometimes if you want. One of the cool things is you can also [indiscernible] application asynchronously, because now if you didn't do it right away, you have some way out to recover. So there's different ways, yeah. >>: What are the scenarios that you cannot recover from? where you cannot recover. Give us a scenario >> Matei Zaharia: Actually, we can lose all the nodes in our system. As long as the input data on the original file system, like the ones I showed, you know, this file here is still available, we can recompute everything. So it's designed so any subset of the nodes can fail. >>: So data [indiscernible] I see this as incremental computation. >> Matei Zaharia: Uh-huh, yeah. There's a lot of -- it's definitely, it's definitely inspired by lots of systems have done this kind of logging. I think the thing -- I mean, honestly, the thing that's interesting about this is the applications we applied it to. So when Google wrote the Pregel paper. There's a whole paper, and it says they actually don't even have this kind of fine-grained recovery. They just take checkpoints and they say we're working on a thing where we think we can do fine grained recovery. In Pregel, we implemented Pregel with fine /TKPWRAEUPBD recovery in 200 lines of code. So it's just, yeah. >>: But the obvious follow-on is once you have incremental computation, you can use it for lots of other ->> Matei Zaharia: >>: That's true, yeah. So you can change part of the data and -- >> Matei Zaharia: Yeah, we actually haven't done that in our system, but it's an interesting thing to try to do next, definitely, yeah. 13 >>: You can't do that [indiscernible]. >> Matei Zaharia: In databases, there are lots of things they can do, okay. You should never tell database people they can't do something. That's a lesson I've learned, with all due respect. Okay. So that's kind of the abstraction. Let me also tell you a little bit about the system, and then I'll go to some of the things we did next with it. So we built this system called Spark that /EURPments this, and I just wanted to show a little bit of how it works. Basically, so Spark exposes RDDs through this nice and simple interface in the Scala language, which is kind of Java with functional programming. As Bill describes it, it's kind of like C-Sharp. So there it is. It's like C-Sharp with stranger syntax. And we didn't -- I'm not saying we invented this model. So the model is very much inspired by the API of Dryad link, but it lets you write applications in a very concise way. And one of the cool things we did that I think is unique to our system is we also allow you to use et interactively from the Scala shell, and it makes for like, you know, it makes it very easy to explore data. So this is, you know, kind of some of the syntax where basically, you create your dataset, you apply transformations like filter, and this funny looking stuff in red here is Scala's syntax for a function literal or closure, so it's like lambda X [indiscernible] and then you can keep doing operations on it and keep building a lineage graph and computing things. So I wanted to show you this on an actual learning system just so you can see the kind of things it does. So basically, in this -- I've set up Spark cluster on Amazon EC2. Let's check that everything is still there. It is. Okay. And I have 20 nodes and I have a Wikipedia dataset I loaded on this. It's just a plain text dump of all of Wikipedia, that's 60 gigabytes. So it's not huge, but it's a thing that would take a while to actually look at on a single machine. And I'm just going to show you how you can use this interactively to do things. So this is the Spark shell. You can do your standard Scala stuff in there. And you have this special variable, SC, or Spark [indiscernible] that lets you access the cluster functionality. 14 So first thing I'm going to do is represent the text file I have sitting in the Hadoop file system. And this is going to give us back an IDD of strings so it's a distributed collection of strings. And so we can actually like start looking at it even without doing stuff in parallel. So there's a few operations you can do that will just speak at the beginning of the file. So if I do file dot first, that gives me the first string, and you can see what the format is like. So this is a tab separated file. You have article ID. You have the title, and yet is maybe the first thing alphabetically in this Wikipedia. You have date modified. You have an XML version, and you don't see the last field, but there's a plain text field at the end as well. So what you can do is you can take this and convert it into a form that's easier to work with. So, for example, I'm going to define a class to represent articles and I'm going to pull out the title and the text from it. And now I'm going to do some map functions to turn these lines of text into article objects. So first I'm going to take [indiscernible] and split it by tabs, and that syntax again is the same as doing this. So it's like a lambda syntax, basically, it's the shorthand form. And then I'm going to filter. So some of these things actually don't have the last field, the plain text, because they're things like images, so I'm going to just filter out the ones with exactly five fields, and I'm going to map, I have this array of fields. New article. And I'll take F0. Say F1 is the title, and F4. You can't see that, but it's exists somewhere there. So now I have this article object. So all these things happen lazily. It doesn't actually compute it until it needs to. But I can do stuff like this to see articles, you know, first article is still and yet and yet. So the last things I'm going to do is tell it I want the articles to persist in memory across the cluster. You can choose which data sits in memory, which one is just computed ephemerally as you go along. So I'll mark it that way. So now I'm going to do a for example, how many of to be that text contains it's actually submitting HTFS. So scheduling the that stuff. question on the whole dataset, and I'm going to count, these contain Berkeley. And so actually, this needs Berkeley, the plain text of the article. And so now these tasks to the cluster, and it's going to go on tasks according to where the data is placed and doing 15 And it goes along, so basically the class article I typed in, the functions I typed in get shipped to the worker nodes and they got [indiscernible] and this is kind of the straggler problem, but hopefully it will finish. Yeah, there you go. So there it is. It's live, it's happening on Amazon. I'm sure the Microsoft cloud never has stragglers. Okay. So we scanned this thing and there were 15,000 articles. But it took 27 seconds. Not exactly interactive. So let's try to do it again now. And now the data will be in memory, because we call it persist. So if we do it again, we get back the same thing in the 0.6 seconds. And we can ask other questions now. So, for example, the one I like to ask is Stanford. Let's try this one. And this is only 13,000. There I hope no one's from Stanford. Last thing I want to risky part of the demo. So we have 20 nodes in this get rid of one of them. So Berkeley was 15,000. you go. Okay. And so now show, this is sort of the cluster. So let's try to So these are the ones. I'm just going to pick a random one, and see a weird Firefox menu and just kill it. So there you go. So it takes a little bit of time to shut down, but once we look at it here, eventually it will drop out. So you can see now there are only 19. And you can see this guy was also notified that it's last, and we last this out. So let's try to do this again and see whether we get the same answer. And now, at the end, you know, there are a few -- yeah so you can see at the end it went kind of quickly, but there are a few -- like the last 30 tasks or whatever on that node were lost, and those were rebuilt across the cluster.. So this is where I'm saying you can recover pretty quickly, even if a couple of failures happen, because you do this in parallel. So that's kind of it. And, of course, now that it's actually in memory again so if we do this again, it's back to its usual self. So that's kind of what the system looks like. Okay. So let me see. So I part from doing kind of searching of Wikipedia, this is also good for things like machine learning algorithms. So we took a couple of, you know, very simple algorithms, but that we run on this, and basically these iterative algorithms are running a bunch of MapReduce on the same data and if you share that data using IDDs, you can go a lot faster. 16 So depending on how much computing it does -- a bit more computing, that was 30 times faster and this is about a hundred times faster. And other people have built in memory engines for these algorithms, Piccolo is one, but most of the engines out there either don't provide fault tolerance or do it using checkpointing, which you have to periodically save your state out, and that costs something. So we got similar speedups to what they got, but we have this fine-grained fault tolerance model as well. >>: Do you have some results where you have failures during the run? >> Matei Zaharia: No. That would actually be cool, yeah. I mean, we did -in our paper, we have some results with failures. And the same thing happens. Like the iteration or something fails, takes longer to recover. But I don't have them on the slide here. >>: So there's a dissonance between this slide and part of the motivation for your talk. When you were reviewing your motivation, you were saying that computation speed, CPU speeds are slowing down and can't keep up with big data. What this slide would seem to say is the CPU speeds are just fine. It's the communication and the algorithm that ->> Matei Zaharia: That's true, yeah. I guess what I meant to say, there is the capabilities of a single machine, yeah. But it's true, many of these things are not CPU bound. Actually, the thing that is really causing clusters to become bigger is actually disk bandwidth. Disk bandwidth hasn't gotten very fast, and so you can buy these disks, they're huge, but it takes like, you know, to read a terabyte off a disk, it will take you many hours. So you need to put thousands of disks in parallel. That's actually the thing I think that really causes this. So that's kind of the system, and one other thing I want to say is as I said at the beginning, we wanted to show this as pretty general so we implemented a bunch of these other models on it as well that people have proposed. We have these iterative ones that we implemented. Graphlab, if you're familiar with it, we can only do the synchronous version because that's version is deterministic. And another cool thing we implemented, actually appears at this year's Sigmod is a SQL engine called Shark, and the story is there at least in the database community, there was this kind of debate between databases and MapReduce. People thought that, okay, well, MapReduce adds fault tolerance during query execution. Most parallel databases don't have that. But the cost 17 of the fault tolerance is so high that it's not worth it. So Shark actually gets similar speedups over Hadoop that the parallel databases do. So it can run these queries, ten, a hundred times faster, and it simultaneously have the fault tolerance that you saw before. And the thing about this also is it's not just a matter of saying, you know, our thing is more general. It also means applications can now inter mix these models. So, for example, one of the things we're doing in Shark is letting you call into machine learning algorithms that are hidden in Spark and data never has to be written to some intermediate file system in between. It just runs in the same engine. And final thing, we've been lucky in doing this to also have growing community of actual users. So we open sourced Spark in 2010 and in the past few years, we've really seen a lot of growth in who's doing things with it. So just some quick stats on that. We held a training camp on Spark in August, and 3,000 people watched online to learn how to use it online. We have a meet up, in-person meetup in the bay area and we have over 500 members that come there. And we have, in the past year, 14 companies have contributed codes to Spark. There's some of the companies and universities that have done things with it at the bottom. So you can find more about that on the website. >>: Stanford doesn't seem -- >> Matei Zaharia: Yeah, I've shown the demo to some Stanford people in the past. So that was kind of the Spark part. I also want to talk a little bit -so that covered the interactive queries and iterative algorithms. I also want to talk about streaming and this is a systems bit that we did next. I think we're actually still working on a bit now. So the question here was just how do we perform fault tolerant streaming computation at scale. And the motivation for this is that a lot of big data applications we have today receive data in real time and see sort of real value from acting on it quickly. So things like fraud detection, spam filtering, even understanding statistics about what's happening on a website after you make a change to it or you launch an ad campaign. And for example, Twitter and Google have hundreds of nodes that are doing streaming computations in various ways to try to deal with this 18 data. So our goal was to look for applications with latency needs between half a second to two seconds. So we're not looking at, like, millisecond quantitative training stuff, but we think this is still pretty good. But be able to run them on hundreds of nodes. The problems, though, is that stream processing at scale is pretty hard. It's harder than batch processing, because these issues of failures and stragglers can really break the application. So the fault recovery, fast fault recovery was kind of a nice thing to have in the interactive case. But here, if you don't do it quickly, you might just fall behind and you've suddenly lost the whole point of doing a real time computation. Same thing with stragglers. If a node goes slowly, rather than making a joke about it in the talk, you're now seven seconds behind where you were supposed to be in this stream. So there's been a lot of work on streaming systems, but traditional streaming system designs don't deal well with these problems. So traditional streaming systems, based on this -- we're calling it continuous processing model, and it's a very natural one. But it becomes tricky to scale. So in this model, you have a bunch of nodes in a graph, and each node has a long-lived mutable state. And for each record, you update your state and you push out new records to the other nodes. So state, by the way, is the main thing that makes the streaming tricky. So an example of state is you want to count clicks by URL data percent clicked. So you have this big table, maybe you partition it across nodes and everyone keeps track of counts for a slice of the /*URLs. And this is how these systems are set up. So when you have this model and you want to add fault tolerance, there's two ways that people have explored, replication and upstream backup. So the most common one that's done in basically most of the parallel database work and systems like Borealis and Flux, is in replication. In replication, you send a copy of the input to -- you send the input to two copies of the processing graph, and each copy does the message passing and state updating in parallel. There's also a subtle thing in replication, though, which is that you need to synchronize the copies. And that's due to 19 non-determinism in a message [indiscernible] across the network. So, for example, in this one, imagine node one and node two are both sending a message to node three. Now, at roughly the same time. Now, which of those gets there first will depend on what happens on the network. But node 3's state might be different if it got this one before that one. So the copy here of node 3 needs to know that. And show these protocols, Borealis and Flux do a lot of fairly complicated stuff to actually keep these in synch and keep them in synch even if a node fails and another one comes back and stuff like that. But even discounting the cost of that, you know, basically, this model gives you fast recovery from faults instantaneous, but you play at least 2X the hardware cost. Okay. The upstream backup model is another one that's been proposed. In that one, you don't have extra copies. Instead, nodes, checkpoint periodically and they buffer mentals they send since the checkpoint. If a node fails, you have to bring up another copy of it and splice it into the graph. And this model has less hardware cost. But it's also slower to recover. In particular, at high load, the new node needs to not just recover from a checkpoint but also keep up with the arriving stream. So it can take a pretty long time to actually catch up with the rest of the system. And a bigger problem is that neither of these approaches handles stragglers very well. So in the replication approach, because of the need to keep the replicas in synch, if one of the nodes is slow, you went up slow both rep cas. And in the upstream backup approach, you don't really have anything you could do, except maybe treat the slow node as a failure and then it's expensive to recover. So we wanted to design a streaming system that met a sort of fairly ambitious goals to actually be able to do this at scale. So we wanted to have a system that can scale to hundreds of nodes and has minimal cost beyond just the basic processing. So no 2X application or anything like that. We wanted to tolerate both crashes and stragglers. And we wanted to be able to attain sub-second latency and sub-second fault recovery. So the way we did this is by starting with an observation about the batch processing models, like MapReduce and Spark. So these models actually manage 20 to provide fault tolerance in a very efficient way without replicating a lot of stuff because of this deterministic recomputation that we saw. So they divide the work into small, deterministic tasks and then determine if a node fails, they rerun those in parallel on others. So our idea was can we just run streaming computations as a series of very short but deterministic batch-like jobs and then we can apply the same recovery models but at a smaller time scale so you know our task, instead of being several seconds in length might be like several hundred milliseconds. And so it kind of becomes this system optimization problem of just making a system that does that quickly. And to store this state between time steps, we're going to use RDDs, which we came up with this way to store state in memory. So that's what we ended up to go and we call this model discretized stream processing. And basically, the idea of the model is we'll divide time into small steps, and the data that arrives in each step is put into a dataset and basically it's an immutable dataset. And the input data, we have to store reliably, because if we lose that, we can't go back and recompute stuff. So this will be replicated. But you probably want to store that data anyway. After that, you do a batch operation that's deterministic, and you produce new datasets, and these can either be output or they can be staged that you're going to use on your next time step. And these are stored in memory, unreplicated as an RDD and we can reconstruct them with lineage. On the next time step, you take your new data, you take your old state, and you do a MapReduce like computation. >>: Seems to me like this state either becomes -- if you lose this state, you have to reconstruct it, you really haven't gained anything, because you're either going to have to go back -- suppose it's a one-week window. You're going to have to construct it from a week's worth of activity, or you'll have had to squirrelled it away somewhere so you can reload it. >> Matei Zaharia: Yeah. So the way you do it is you do periodic checkpointing. But the key is you don't need to checkpoint everything. Like, for example, say you're doing this every second and then every ten seconds, you store that dataset reliably. You just like asynchronously write it out to 21 another copy. >>: So that's what we're going to do. Using the check point to truncate -- >> Matei Zaharia: >>: Yeah, to truncate the lineage. Is that one of the strategies that you talked about earlier? >> Matei Zaharia: Yeah, it is, but the difference -- yeah, I mean, the difference is you don't hold back the whole application when it fails. So checkpointing in the parallel computing systems usually means if a node fails, I just kill everything they can to recover and I go back to the previous checkpoint. Here, I just recompute the last stuff and that can be significantly faster. >>: So it's almost like you're applying the technique from a streaming setting [indiscernible]. >> Matei Zaharia: Yeah. >>: Okay. I'm sorry. I misunderstood. I thought you were doing something different than streaming recovery. But you're applying streaming recovery to a batch setting in order to improve the batch recovery? >> Matei Zaharia: setting? >>: You mean I'm applying the batch recovery to a streaming No, the other way, right? >> Matei Zaharia: well -- I'm not doing -- you mean the application way. Yeah, >>: You're basically doing a checkpoint to truncate how far back in the logging ->> Matei Zaharia: Yeah, absolutely. But checkpoint is different from what I showed as replication, because checkpointing just means -- so the replication approach has to keep the state in synch, had to do this synchronization protocol. Checkpointing just means I have a copy in memory, I going to also send it to another guy and maybe I'm also going to write it to disk. So it's 22 an asynchronous thing and doesn't require them to be in any way, you know -yeah. >>: I had a quick question, it's sort of [indiscernible] the size of the dataset is going to be for iteration. So can you do something ->> Matei Zaharia: The state of size -- >>: Distributed parallelism if the [indiscernible] is small in most cases and deal with local parallelism in [indiscernible]. >> Matei Zaharia: Oh, I see. So if the stream fits on a single machine, you can do it. We're specifically targeting things that need to have high degrees of parallelism, and there's two reasons why. Like either the stream might be big, for example, you're collecting logs from all the machines in your datacenter, and each one's logging many, whatever kilobytes per second or something. Or the computation might be big. So one of the applications we did was online machine learning algorithm. It's very CPU-heavy, and it needs to run on many nodes to actually do real data. So that's what we're targeting. >>: What is the computations for depends on series of time steps [indiscernible] T1 and T2. >> Matei Zaharia: Yeah, so we don't have -- so I don't have slides on this, but we do have operators that do that, and basically an operator can go back and take data from farther back time steps. It's not just the previous one, yeah. And we do incremental sliding windows where you add current data and subtract stuff from ten seconds ago. So we do stuff like that. We implemented -- basically, a lot of the optimizations people did for stream processing and databases, you can express them this way, because all you're doing is doing the same thing, but in these little batches, right. Like it's an algorithmic optimization you can still use, yeah. Okay. So that's -- thanks, these are good questions about the model. So basically, I talked about this already. So the way we do fault recovery is we have to do this checkpoint periodically, but it's asynchronous. It just means that everyone has to write their stuff out to another node and you don't need to do it O to have been because recovery is parallel. We have the same story 23 as before. Something goes away, now other nodes can work in parallel to rebuild that. And so if you look at it compared to previous recovery approaches, we have faster recovery than upstream backup, but without a 2X cost of application. that's kind of the model. So So the question is we broke this thing into these little batch jobs. How fast can we actually make it go? And we found that compared to other systems out there, it can actually go pretty fast. So we were able to process up to 60 million records per second, or six gigabytes per second of data on 00 nodes at sub-second latency. So this graph here is showing two applications. This is just like searching for regular expression, and top K is a sliding window word count followed by top K. And they both scale pretty much linearly to 100 nodes and the lines here are showing if we allow a latency target of one second versus two seconds, how -you know, how much throughput can we get. And even with sort of sub-second latency, I think these were around 5 or 6 hundred milliseconds, you can still get pretty good throughput. We compared it with a few existing systems. >>: So in those -- [indiscernible] checkpoint? >> Matei Zaharia: It's impacted by the size of the windows, the little Windows, and that -- because there's some communication and scheduling costs to launch these things, yeah. >>: Checkpoint frequency? >> Matei Zaharia: The checkpoint frequency is not a huge deal. That affects recovery time, but the checkpoint frequency can be pretty low, like even if you checkpoint every ten seconds, like only one-tenth of the windows, you can recover quickly. I'll show about that later. >>: Do you have an idea like what the difference would be compared to [indiscernible] ten seconds? >> Matei Zaharia: Actually, I don't know. I think we tried a few bigger 24 windows, but it was a while back so I'm not sure with the cur rent system. I don't think it's huge. I think maybe there's a difference like maybe up to 50 percent, something like that, but it's not huge. Because when you're getting to a ten-second window, you're getting into the realm of normal Spark jobs. Like the machine learning ones I showed, you know, they were doing one iteration in like one second or four seconds. So it's not a big deal. Where this is more interesting is where we push it to, like, a multi -- like this is a three-stage multi-use job that's happening in 600 milliseconds. That's where it becomes really interesting. >>: Once you bring [indiscernible], you tend to need more resources to finish the same job as because your breakdown becomes less efficient? >> Matei Zaharia: Yeah, exactly. That's what it is. There is this [indiscernible]. But it becomes an engineering kind of problem, like let's make a fast scheduler for these things, which is a thing we're happy to deal with, yeah. So just to get through what we did here, we compared this to a few existing systems. So in the open source world, probably the most commonly used system is Storm from Twitter, this is which type of message-passing system. In general, we didn't expect to be faster for any reason, but we just wanted to show that we're in the same ballpark. And in storm, we were actually faster, depends, between four and two times. So depends what we're doing. But it's mostly, I think it's slightly better engineering than what we did. The point is it's a comparable performance to sort of this real system. And commercial streaming database systems, they don't have very specific numbers, but they say, you know, we can do maybe 500,000 or a million records per second in total for the whole system. And this is -- usually, they don't really scale out across nodes. But what we did is we did about this many records per second per node and the results I showed before. But we also scaled linearly to 100 nodes. And the other thing is the way these guys do fault tolerance, so storm doesn't actually have fault tolerance for state the way we do. It only just ensures each message will be seen at least once and these systems use application. So we are able to do this while also providing nicer recovery mechanisms. 25 And apart from speed of computation, there's also speed of recovery, and we found that even with a pretty big checkpoint intervals, we were able to recover quite quickly. So often, we could do it in less than a second. So this one here is showing the sliding, like, word count with ten-second checkpoint interval, and this is just the processing time of each batch of data. And when we get rid of a node here, that little chunk of data takes about another extra second to process. And then there's this window here that because we're doing a sliding window, we keep going along and taking new data and subtracting data from ten seconds ago, so there's this window of vulnerability, where we may have to recompute other stuff. And after that, we're back to normal operation, basically. One other thing I want to show here is how this varies with the checkpoint interval and cluster size. So one of the things -- so, for example, we tried doing 30-second checkpoints on 20 nodes instead of the ten-second that I showed before, and even with 30-second checkpoints, you can recover in about three or four extra seconds from what you were normally doing. And the other cool thing is as you add more nodes to the system, recovery gets faster. So when we add 40 nodes instead of 20, we're actually recovering about twice as fast. So what's cool about this is this is a recovery mechanism where scale is an advantage, whereas in the previous ones, the synchronization, all that stuff gets harder with scale. >>: [indiscernible]. >> Matei Zaharia: It does go up, yes. So that's true. Actually, it's a good question, on average will it help or not. Maybe on average, you're still breaking even, actually. Because twice as many faults, but you recover twice as quickly. That's a good point. Okay. Cool. Okay. So I have -- so I've had, you know, decent amount of questions, but if you guys want to stick around for five minutes, I can talk about this stuff too. What do you think, Rich? Do you think it's a good idea? >> Rich Draves: Yes. 26 >> Matei Zaharia: Okay. So I wanted to also talk very briefly about this, and we can chat about it after in person. One of the things we did here. So apart from looking at systems, I do like to look at scheduling algorithms and policies and try to analyze things there. So one of the cool problems we looked at here is multi-resource fairness. So let me just set that up real quick. So basically, in lots of computer systems need to do -- need to divide resources across users. And the most common way they've done it is weighted sharing, proportional sharing in the operating system world. Examples of that are fair queueing on network links or lottery scheduling for the CPU. So fair sharing basically divides one resource, like the CPU cycles you have or the link bandwidth, according to the weights for each users. So, for example, if the users all have equal rates, it's going to split this -- you know, they each get a third of it. But the problem we saw, as we were building these cluster applications, is that cluster applications have very different demands in terms of multiple types of resources. So some applications might compete on CPU. Others applications I've just been talking about how it's important to use memory. Some applications might be bottle necked on IO bandwidth and so on. So you can't really do a scheduler for these systems that only looks at one of these resources or tries to split them up in a fixed ratio. So the question we had is how can we generalize fair sharing to multiple resources? Just as an example of this, what you're going to see is say you have a cluster and you have equal numbers of CPUs in memory, 100 CPUs, 100 gigabytes, and you have -- one user has bag of tasks they want to run that each use six CPUs and one gigabyte of ram. The other user needs three CPUs and four gigabytes. How much should you give to each one? So we tried a few policies here. A few of the natural ones and found a bunch of interesting problems happened that don't happen with a single resource. So first thing we tried is something we called asset fairness, and the idea there was just let's treat the resources as kind of the same, let's say, coherency. So having one percent of the CPU is the same value as one percent of memory. 27 And let's try to equalize the users overall shares. So the numbers in the examples we had before, you end up with this. So first user gets six-ninths of the CPU, one-ninth of memory, the second user gets three-ninths of the CPU and four of RAM and in total, they each have seven-ninths. But with this policy, even though it's really natural to do it, there's actually a problem. So the problem is that one of the users gets less than half of both resources. And we call this, the properties, or we say that violates the sharing incentive property. And by sharing incentive, we mean that, you know, if users contribute, say, equal amounts to the cluster so they have equal weights, one user should be able to get at least half of -- you know, of one resource. So they should be at least as well off as if they had just gone off and built two separate smaller clusters. And now this guy here, because, you know, the top guy is not using memory very much, the one at the bottom is getting less than half of both. One thing we tried to fix this that has other problems is called bottleneck fairness. This is another really natural thing. So you might say let's take the resource that's most contended and split that equally. So here you're not contending for memory, so we'll give the second person, we'll let them get more memory. This looks, again looks like a normal, pretty natural thing to do, but there's actually another problem here, which is that users can start to gain the system. So it's not strategy-proof. So for example, here the bottle neck was CPU and user one only got half the CPU, but he really wants to use CPUs. What that user can do is change the demand he gave to the system, instead of saying I need six CPUs and one gigabyte of RAM, I actually need five gigabytes of RAM and he'll get them and not use them. That shifts the bottleneck to memory, and now this user gets more of the CPU they actually wanted as well. So these are problems that just don't happen with one resource. So our approach was basically to characterize the properties of single user's fairness that make it nice and try to come up with a multi-users policy that has the same. In a nutshell, the one we came up with is called dominant resource fairness and 28 it's to equalize the user's share of the resource they use most. So this user's share of CPU and this year's share of memory are going to be equal. And we showed that this always has the sharing incentive property above. It's strategy-proof. There's no benefit to lying about your consumption. And it has a bunch of other properties as well. And we compared it with a few policies. One of the things we compared with, this is a preferred sharing policy in economics, comparative equilibrium, which is basically a perfect market. And we found that that actually lacks some of the properties that DRF has. Yeah? >>: Do these properties [indiscernible] for more than two users? >> Matei Zaharia: Yes, this is for any users and, like, end resources. Yeah. >>: So is the assumption that the user has to specify the resources to one another? I mean, so you say ->> Matei Zaharia: >>: Yeah. I need six CPUs and a gigabyte of memory. >> Matei Zaharia: It's assuming the user is in a fixed ratio. So it's not like -- it's definitely not a general thing. If a user has two types of things, it's not going to cover it. But we started with this one where they have a fixed ratio and they wanted to come up with something for that. Yeah. >>: How does this work compared to [indiscernible]. >> Matei Zaharia: Good question. Actually, I'm not sure exactly what she's done lately. I saw what she was doing in the past, but I think one of the differences in her work was that she did mostly like cache resources on, like, buffer cache within applications and looking at like hardware mechanisms to even enable that to happen. I think the policy is something you might use there. I don't think they looked at this problem of just what [indiscernible] policy is. They looked at what's the hardware mechanism. I might be wrong, though, because I haven't looked at it in a while. What about the resilience of [indiscernible]. 29 >> Matei Zaharia: Yeah, so actually, yeah. Right. So this is actually resistant to collusion as long as each user uses some of each resource. I think if some users have zero demand for a resource, then you can cheat. So one of the interesting things, we didn't prove that, but the interesting thing that happened is a bunch of economists actually looked at this after and tried to look at other properties and that's one of the ones that they said. >>: So this seems to assume that there is a dominant resource. >> Matei Zaharia: Oh, yeah. That -- Per user. >>: Which if you've got a large enough memory, then if you can give me the full memory on a machine, then I may not care about network bandwidth, and so memory is my dominant -- but assuming that's never going to happen, then never bandwidth may be my dominant resource. >> Matei Zaharia: Oh, so you're saying your application might -- yes, your application has different ways of running based on the resources. >>: It has different efficiencies based on different -- >> Matei Zaharia: Yeah, we didn't deal with that yet. So there's lots of ways to try to generalize this, and actually I'm interested in doing some of it. We've done a bit already. But yeah, we didn't deal with that. >>: Where do you say this thing applies? single machine? I mean, is this the level of a >> Matei Zaharia: Yeah, so we applied it in a cluster scheduler, Mesos, which I didn't talk about. One of the cool things is actually the Hadoop team is now applying this in Hadoop, independently implementing this. And we had this paper at Sigcomm last year where we applied it in software defined [indiscernible] and middleboxes also. So where you have flows that go through different modules, like intrusion detection, they might stress different resources. >>: So this is after you've already [indiscernible] resources and now you want to share -- 30 >> Matei Zaharia: Exactly, the model is you have users within an organization. It's not an economic thing where you're paying for resources. >>: Then perhaps some of these issues about cheating and lying, some of those things probably ->> Matei Zaharia: Yeah, so actually the cheating thing came because we saw people doing this in real clusters. So for example, one interesting story there from Google, they used to have this policy that if you have utilization above a certain level, they'll give you dedicated machines. They found users would actually add like spin loops and things. So there was a similar thing with Hadoop users. So users get very creative. One of the Hadoop things, like Yahoo built these like 3,000-node Hadoop clusters, and users want the to run MPI. So people like go to map function that runs MPI in it, and you can imagine that really messed up the networking and the data locality for the other jobs. >>: How does this compare to a natural market, since you assign a price per unit to each resource. >> Matei Zaharia: That's a good question. So this is kind of what the competitive equilibrium does. So competitive equilibrium is if we had a perfectly competitive market, what would it allocate? Now, it's not the same as users actually bidding, but the problem is with users bidding, it becomes very complex for the users to do stuff. So but it's competitive equilibrium is kind of the outcome if they did bid in a perfectly competitive market. The problem is the assumption of perfectly competitive, which can beg some things. Okay. Cool. So I'd like to talk more with people about this after. So let me wrap up. So just one other thing I wanted to say. So I do like to build sort of real applications in real systems. And working in this space, especially because it's such a new space, I've tried to open source things and I've been lucky to have people actually try to use some of these. So I talked about Spark and Shark, but some of the other systems that I've worked on have also been used outside. So Mesos cluster manager is actually used at Twitter to manage their nodes. They have over 3,000 nodes now that 31 they're managing. DRF is being independently implemented in the Hadoop 2.0 design. LATE, an algorithm for straggler handling, is also in Hadoop. Delay scheduling, which is a thing I did for data locality, I actually wrote one of the most popular schedulers for Hadoop, called the Hadoop fair scheduler as part of that work. And it's still being used today at places like Facebook and eBay and so on. And finally, the thing I've been working with folks here, SNAP sequence aligner which is really fun but totally different thing, looking at gene sequencing is actually started to be used -- there's a group at UCSF that's been using it to try to build a pipeline to find viruses faster. So basically, this is one of the things I really like about this space is that there is room to actually build things that people will use. Just to summarize, basically, you guys already know Big Data systems problems, but I hope I've shown you some of the problems that can happen and some of the research challenges. I've talked about these two things. I've talked about this way of dealing with faults when you have coarse-grained operations, which is very common in data parallel algorithms, and we've applied this to both batch and streaming. And I've also talked about the multi-user straining problem, which one of the few that you can look at in these clusters. So that's it, I'd be glad to take any more questions. >>: [indiscernible]. How did you make sure that there was [indiscernible]. >> Matei Zaharia: So the user -- basically, so we don't enforce the determinism. If you happen to call a lie /PRAER that's not deterministic, like you call [indiscernible], but we do provide ways -- so the operations we provide, as long as you pass a deterministic function, it will produce the same result. And, for example, for things like random number generation or sampling, we have a sampling operation, we just see that with this sort of task ID so that it's always the same ID ->>: So you [indiscernible]. >> Matei Zaharia: We just, it's a bit simpler. We just see the random number 32 generators, things like that. But if users are doing stuff that's nondeterministic, we don't catch it, it would be interesting to try to force that or even to detect it. We do have some work now that will detect it by just check summing the output and seeing whether it's the same. But it won't tell you until you've already messed up. >>: I didn't catch, when does an RDD die? you just get rid of it? >> Matei happens. need it, actually while it When do you not use it anymore? Do Zaharia: Good question. We have -- you can actually set what There's different storage levels. So one is just drop it and if I I'll recompute it again. The other one is [indiscernible]. And we have like an LRU cache in there. So each node keeps accumulating data can, and then things drop out at the end. So if you are keeping it stored, then you have to assume that writing to memory is effectively at the cost of writing to disk, because in the long-term, you're going to have to build something up. >> Matei Zaharia: Well, that's only true if you use the storage level where they spill out to disk. So by default, actually, we just drop it. And if we ever come back to it, recompute it. >>: When you're done with it, you have the persistent number that you know. >> Matei Zaharia: Yes, but all that means is that normally, like in my demo, I made all these intermediate data sets. Each time I do a map or filter, it's another RDD. But by default, they're not even saved to memory. They're just, you compute it in kind of a streaming fashion and then you drop it out once you've got the result. >>: [indiscernible]. >> Matei Zaharia: So on the ones you call persistent. So basically, RDD is just a recipe for computing a dataset. And if you mark it as a thing you want to keep around, it stays. It's kind of a weird thing. But it's -- that's just the programming model that made sense, because people build up a dataset out of many transformations, and they don't want to save the intermediate result. >>: [indiscernible]. 33 >> Matei Zaharia: To some extent, but it's still expensive to just allocate space and write it. So, for example, if you do a map and then another map and none of those is being persistent, we do it one record at time, so we do it in a pipeline fashion and it's just better for cache usage and memory bandwidth and stuff like that. >>: Is this meant to replace intermediate data between the mapper and reducer? >> Matei Zaharia: It doesn't actually change the more between MapReduce jobs. So between a mapper our system, we still have the maps actually write in memory first and then it goes to disk. And we >>: way that works. It's really and reducer, we still -- in out a block. And again, it's don't actually push -- yeah. [indiscernible]. >> Matei Zaharia: It actually wasn't. Mostly, it was -- so the things that can be problems are network bandwidth. So, for example, when we were receiving, I said we received like I don't know how many gigabytes per second. Part of the problem is you have to replicate the input data also. And that can actually really be a bottleneck. So that was one thing. The other thing that will happen in large versions of this now is eventually, the scheduling will become a bottleneck. So I think a cool, like, future topic would be how can we make this scheduling faster, and can we even do things like decentralize scheduling or work stealing, where it doesn't need to be done by one node. >> HOST: Any other questions? Okay. Thanks.

1

Related documents

Products

Support

1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib