Document 17865055

>> Sharad Agarwal: All right. Thank you for coming. We will get started. It's my pleasure to have George Porter here. George is from UC San Diego. He got his PhD back in 2005>> George Porter: 2008. >> Sharad Agarwal: 2008, sorry, from Berkeley; and he's spent the majority of his time since then at UCSD. You know George from his data center work. He's been doing a bunch of work with a bunch of fabulous students at UC San Diego. He's going to tell us about that. George also took over pretty much all of the data center research from Amin Bada[phonetic] when he left and went to Google. And George is also helping manage and run the center for Network Systems at UCSD. Awesome. Thank you. >> George Porter: All right. Thanks Sharad. And thank you all for the opportunity to come and give you a talk today. It's a real pleasure to be here. I don't think I need to tell this audience this, but please ask questions during my talk if you have any questions or comments. All right. So my name is George Porter. I'm going to do be talking today about building resource-efficient data-intensive applications. So it's incredible how much of our lives has moved online from information gathering, entertainment, collaborative tools, healthcare, medicine, government, effectively everything we do in some way has moved online; and each of these applications is fundamentally driven by data. The quality of your user experience using all of these sites depends on, in some way, the quantity of data that each of these applications can process over. And it's not just that these things are driven by data. It's sort of like if you think of the way Amazon uses data to build, say product recommendations, Spotify uses data to build custom radio stations, Bing uses data to build personalized search. And it isn't just that they're data driven but they’re data driven on a per user basis. So, for example, if we look at the way Amazon works, when you visit the main sort of landing page on Amazon.com, the page that gets generated is customized to you. And, in fact, each time you access this page there's over 100 underlying applications that are doing things like collaborative filtering, consulting with ad networks, your previous purchase histories, preferences, etc. And each of these applications is driven by data. And so there's an enormous amount of data processing at IO that goes on ahead of time, behind the scenes, before you arrive at these pages in order to generate all the data that's needed to consult during that request so that this content can be customized to you. And so all of these different applications and data processing requirements have driven the need for very large data center deployments. So in order to scale up to meet the needs of all of these different users companies like Microsoft, Google, Facebook, and others have developed these large data centers which are warehouse scale buildings housing sort of tens to hundreds of thousands of servers, storage, networking gear, cooling power, etc. And the result of these data centers has been an incredible amount of scalability. And so, for example, Google served out 100 billion searches per month, Facebook has 1 billion active users, Amazon has, actually is closer to about 200 million users today. And this incredible scalability result has had to, has been arrived at in an incredibly short amount of time. So if you, actually last month was the 25th anniversary of the web; and about 20 years ago the first sort of mainstream web browser was released. And back then really everything was basically smallscale. So Google's first data center fit on a folding card table, and supposedly the first Facebook server was a sort of run out of a dorm room type environment. And so those numbers that I just quoted to you, 100 billion searches a month and 1 billion active users has had to been developed in about 15 years and 10 years respectfully. And I think even at established organizations and established companies you're going to see similar scaling results for anything that’s kind of user-facing. And so these organizations have really had to be driven by a relentless focus on scalability. And so in order to close the gap between no users and effectively the Internet connected world's population in order 10 years, everything that's been developed as had to focus on scalability. So the data centers I described, the applications that run in those data centers, the infrastructure that underlie all of those applications, the storage infrastructure, all of that is really driven to be able to grow as fast as possible and effectively grow at any cost. And the costs are incredible. I don't have to tell you that there's enormous capital expenses in terms of building each of these data centers. And so the way you can think of that is of course that every time you want to roll out a new application or every time you want to grow to a new set of users you have to stamp out effectively one of these billion-dollar buildings. But they’re also incredibly expensive to operate as well with sort of industry estimates at tens of billions of kilowatts, hours, kind of industrywide. And yet, underlying all of these impressive scalability results lies an enormous amount of inefficiency. So again, industry estimates about 6 to 12 percent of power actually gets translated into productive work. And the question becomes a sort of why does that gap exist? Why is there all of this inefficiency in terms of these data intensive applications? And one of the main sources of inefficiency really comes down to IO. It’s really about input, output. You can think of this in terms of IO bottlenecks between distributed applications and the underlying data that lives on the storage layer underneath them, and there's also enormous amounts of bottlenecks in between nodes sort of in a distributed cluster shuffling between each other. And these bottlenecks result in servers that end up waiting for data. So this is referred to as like a pipeline bubble or a pipeline stall wherein one node is waiting for another node to complete or before it can make progress it has to wait till that data arrives from that other node. And so this can cause these cascading performance failures wherein large-scale systems end up spending a lot of time waiting on data. This can also kind of manifest in terms of requiring a much larger compute in the storage footprint than you would otherwise need if you just looked at the amount of processing needed to make these applications work. And so what we really need to do is to focus on recapturing IO efficiency, and this boils down to kind of a very simple application of Amdahl’s Law which is we want to, if we look at data-intensive applications IO is really the bottleneck we need to eliminate any unnecessary IO’s that we can, and for those IO’s that are necessary we want to make sure they are as efficient as possible. So the rest of the talk I'm going to talk about this in two different domains which I'll get to in a second. So stepping back for a second, if we kind of look at the last 25 years we've really been focusing on this goal of scale and achieving systems that are able to scale, and as we kind of pivot and look towards the future it's important that we develop systems that are able to scale efficiently in order to deal with growing user populations and growing data set sizes. And so the work that I'm going to describe falls into these two domains. The first is on IO efficient data processing; and I'm going to talk about some work that my students, myself, and my colleagues have worked on in terms of building very large scale efficient sorting systems and using those to build large scale data processing systems. And the second domain is on the node to node, getting rid of node to node bottlenecks in the system, so focusing on IO efficient data-intensive networking. And we've been looking at data center interconnect designs that rely on circuit switching in addition to sort of more traditional packet switching models and combining those; and we've been able to show that this approach will allow you to sort of cut the cost of your network infrastructure by 2.8X and the power by 6X. And I'll go into these in a little bit in a second. Yes. >>: Are there some numbers to show IO is kind of a big problem in [inaudible]? >> George Porter: So are there numbers, is there a quantitative way to show that IO is important? >>: The IO is, let’s say the big problem, one of the big problems in the [inaudible]. Because you can imagine, for example, [inaudible] rapidly>> George Porter: Yeah. So that's a great question. There are. So when we sort of, I guess you could look at this in two different ways, and one of them, like you said, is to make systems somehow power proportional so the amount of power they draw matches the kind of resource utilization that they are in. And there’s definitely work in that, and that's an important line of work to do; but it seems like, at least in today's systems even if you, either things aren't very power proportional, so there’s a lot of overhead, a lot of costs associated with keeping all the machines running, and so you're really better off trying to drive as much throughput through your system as you can and sort of max out all of your hardware. But that's a sort of a more efficient I guess point in this space. And I hope that in kind of describing this work, I'm going to go through some quantitative analysis of where some of these bottlenecks are, and I think I'll answer your question that there is in fact, you can actually see where those bottlenecks are. Okay. I want to talk first about IO efficient data processing, and I want to start by kind of defining what I mean when I say data-intensive. I've used that term a couple of times already. Over the last 30 years or so the definition of what makes something data-intensive has changed quite a bit. So sort of the mid-80s it might be, let’s say 100 megabytes is a data-intensive job, and today it's maybe 100 terabytes or even a petabyte. And so over this 30 year time span there's been effectively 1 million fold increase in what we mean by data-intensive. And so over that time period the types of applications that have been used to solve these jobs has changed quite a bit. And so today if you talk about data-intensive computing a lot of times what you mean is, for example, MapReduce. This is a representative example of the sort of a dataintensive framework for doing processing. And MapReduce is not actually not a new idea. If you're a list programmer you've been using it for some time, but if you're not familiar with it I'll really briefly describe it right now. So in the MapReduce program you're given this input set of key value pairs and you start by applying a user supplies Map function to each of these pairs. You then group the results of that function application, by keying you sort each group, and finally you apply a user supply Reduce function to each of these groups. So if we kind of zoom into what's going on in the implementation of this programming model we see that the application of the Map function and the application of the Reduce function, which I'm going to call Map tasks and Reduce tasks, this is what's called embarrassingly parallel meaning that we can execute these functions entirely node local without any network communication. So what we are really left with in terms of the building a MapReduce framework is exactly this group I in this sort operation. And in a lot of ways this is really the hard part about building these large scale systems because it's almost exactly opposite of embarrassingly parallel. Generally speaking, data from each of these nodes has to be shuffled, conveyed, and delivered to each of these destinations, and the application of these functions really is bottlenecked based on all of these IO’s completing. And so managing and dealing with all of this IO is an enormous challenge. And we're not the first people to really identify that it’s a huge challenge. Actually, in the mid80s, that now late Jim Gray wanted to focus people's attention on the importance of the IO subsystem when building data processing systems. So looking beyond sort of how many floating point operations per second and starting to look at a holistic view of the system where you see kind of IO's as a part of this. And so the way that he did this was actually really cool. So he proposed a contest, a sorting contest, in which the idea was to see who could sort 100 megabytes of data the fastest. And this was great for two reasons. One of them is that we all learn about sorting sort of freshman year of college, and so everyone involved in these industry and academic efforts kind of had a good sense of what it meant to sort of data. The second reason this was really cool is that across a variety of different data processing applications we now have some representative benchmark or stand-in for what the IO performances, what the resource efficiency of these applications is. So we can kind of do an apples to apples comparison. So obviously things have changed quite a bit since the 80s. So by the late 90s this had grown to a terabyte, tera-sort record or contest which was won by the Berkeley NOW Project. And then when we started our project in 2009 the idea was sort to sort 100 terabytes of data. And so this was held by the Hadoop open source MapReduce Project that was being hosted at Yahoo. So now we've got this really great benchmark of IO efficiency and I was talking about how systems in practice are not efficient. And so now that we’ve got this benchmark let's see how deployed systems do in practice. Yeah. >>: So the thing about the [inaudible] is true with the sort, but aren't there many other jobs where [inaudible] be part of the data since that is the bottleneck? >> George Porter: Yeah. So that's a great question too. The kind of selectivity of that Map operation, it can vary quite a bit. Obviously if you’re searching for data in a very large data set and you're looking for kind of a needle in a haystack the output of that Map will be a small set data set size. I guess the thing about sort is that the selectivity of that is one to one. So every output record, so it's kind of a worst case from the point of view of IO performance. And so for a lot of jobs that are low CPU to data item ratio it looks a lot like sort. And so that's why we've been kind of focusing on it is because often times what you're doing is comparing items, ranking items, things like that that don't require a lot of CPU per item but you do end up having to convey all this information somewhere else. >>: [inaudible] some modification for which how much is 1 to 1 [inaudible]? >> George Porter: So that's a good point too. I'm trying to think, so there are, you can find evidence of this in various published pieces of work and it varies depending on the organization. I think the audience in this room would have a much better sense of what exactly that CDF looks like than I would. So I would love to chat with you about it to the extent you can talk about it. Okay. So, great. Now we've got this benchmark of what we mean by resource efficiency and let's see how well the deployed systems do in practice. So in 2010 these two researchers from HP labs look at the results of that great sort contest and they look at all the winners, and the way they analyze that data was as follows, what they did was they took the delivered performance of each of those sorting systems and they compared it to the inherent capabilities of the underlying hardware platform and they looked at that difference. And the results were surprising and incredibly discouraging. So on average, 94 percent of disc IO was idle, and about a third of CPU capacity was idle. And this is among the winners. So this is definitely not good. And if we kind of look more specifically at this 2009 Yahoo result they were able sort 100 terabytes of data with 3452 nodes in about three hours, which is quite impressive, but if you actually kind of look at what each node is contributing to that overall result what you see is that each of the discs in the system is in some sense running at approximately 1 percent efficiency. So we are in this situation where we are able to achieve these very impressive results via scalability but we are running at sort of low efficiency. And remember, these data centers are incredibly expensive to build and operate, and so the goal that we kind of like to set out is to be able to achieve the same data set size result in the same amount of time but with effectively [inaudible] magnitude, fewer resources. >>: This measurement seems strange to me in the sense like I'm not sure [inaudible] your hardware, especially in the workload, it seems like you always have some resource that is not a bottleneck and [inaudible] utilized. So in that sense what is measure really showing us? >> George Porter: So what we'd like to get is a fully perfectly balanced the system, right, where all our resources are balanced with each other so that no one resource is, or I should say if we were to reduce the amount of any one resource the entire system should slow down. That's some goal that's implicit in the work that I'm describing. Now, as you point out, there's a lot of heterogeneous jobs. And so in one set of workloads or one set of jobs you may end up with one resource as the bottleneck and then some other particular type of job you might end up with a different resource that’s the bottleneck. I do think though that focusing on storage as the bottleneck is the tack that we've taken because a lot of systems really are storage IO limited, and so using this as a way to start solving some of the system’s problems of getting the storage IO up is one of the goals that we've had. I think that can be good. >>: When you use storage IO limited you mean that they are a major source of inefficiency in the sense that it is idling? Is that what you mean by that? >> George Porter: So, yeah. Either they’re idling or they’re fully being utilized and there's not enough discs to actually keep the workload, to keep all the CPU’s busy, or third, they're fully being utilized but there are extra IO's being issued that aren't necessary. That’s another way you can look at it. So I hope that in some way addresses your question. >>: Another metric could be [inaudible] sort of [inaudible] storage or CPUs [inaudible] standard servers scheme for your 100 terabytes for X [inaudible]. Would you still have similar problems? Because that's [inaudible] hundred percent as long as I optimize for the metric under some standard to keep them [inaudible]. >> George Porter: So I think what you're saying is sort of that you’ve sort of settled in a certain sense on a binding of compute, memory, network, and storage, and then you're sort of replicating this unit to the data set size that you need, and I think that's the way people build real systems, right? You sort of provision a server model, you kind of scale that out and that's a cluster that you build, and what I would say is that inherent in making that binding of compute, storage, networking, and memory you already have an idea of what that balance is between CPU and IO. Either IO to the network or IO to the storage. So to a certain extent you've already kind of had this sense that there's some ratio and that's why you build a platform in a certain way. And so when we started this work, and I didn't describe it in the slides here, but we knew that the types of jobs we were going to be working on were low CPU to IO and so we wanted two-use servers that has as many discs as possible in them. So at the time we were able to get machines that had 16 discs, but we had 25 discs that would have been even better. Absolutely. >>: [inaudible] you said that your goal is to build a systems [inaudible] balance [inaudible]. Any one of these [inaudible] impacts the same performance. I don’t feel that's a good [inaudible] because resources don’t cost equal. So suppose this IO was much cheaper than CPU, I would be happy [inaudible] as long as CPU [inaudible]. So, for example, PennySort was essentially geared towards students [inaudible]. So why don’t use that as a goal frame [inaudible] data at the least possible price? >> George Porter: We actually get to that. I think I'm going to revisit this because one of the things we were focused on was per node efficiency and this sort contest, you mentioned PennySort, there's also like a JouleSort contest that we entered as well, and I guess what we found is that the reason that we've been focused so much on disc IO is that otherwise you end up with all of this CPU in memory that's sort of waiting on basically on discs. And you need a lot of discs whenever you have data-intensive applications just because of the capacity issue, and so what we found was that by kind of driving up the efficiency of that resource we end up getting energy efficiency. And I'll describe that in just a minute. >>: [inaudible] define efficiency[inaudible]. >> George Porter: Yeah. So, again, when I talk about the evaluation I’ll mention this is a little bit, the sorting contest is not just an absolute performance contest. There are these different categories. And I get to this issue so there is work done per watt and there’s work done per dollar or for penny and there's simply who can do the work the fastest. And those aren't always, they don't always lead you to the same system design. A case in point of this that’s kind of interesting is that for this eco-sort, for this JouleSort contest there’s kind of like two solutions to this equation, and we've got one of the solutions and Dave Anderson’s group at CMU has the other solution. And so we've been focusing on if we can build a system that could just handle raw throughput what we end up with is even though our servers are 300 watts each and we've got 10 gig networking and stuff you end up with a very highly efficient system. On the other hand, you can focus on atoms and things like that and get a different solution. It just depends on your assumptions. Absolutely. So I'll expose that in just a second. And we've been talking a little bit about this balance issue and I just want to mention, not right now, which is that to a certain extent this project is in the context of a larger set of work on looking at trying to come up with the right balance of these different resources in terms of addressing different types of jobs. And one aspect of balance has to do with the data itself, so in an ideal world we'd like to divide our cluster up into these groups and we would like to distribute our data in a very uniform way across each of these nodes, and then we've got a variety of processing elements on each node, which I'm going to represent with these funnels, which represent in some sense a CPU or disc type resource. And if we are able to kind of very uniformly distribute all this data on all these resources then all of the data can be processed uniformly and we don't end up with a lot of pipeline stalls because data gets generated as it’s needed. And that's a very efficient way to design systems. Of course the real world is not as kind to us as we would like it to be and data can be incredibly non-uniform. So if you look at, for example census data, Seattle is going to have a lot more entries in it then Driftwood, Texas or something like that. And so this imbalance can end up causing some of the nodes to become bottlenecks which causes this cascading ripple effect. But resources are also highly heterogeneous as well. So some discs are faster than others, but even if you bought all exactly identical discs all with the same part numbers, you’re going to end up seeing this very wide variance in delivered performance based on just the fact that you have so many of these resources put into a single cluster. And one of the things is that this imbalance, one of the effects of this imbalance is you end up with not just an inefficient IO but actually wasted IO’s. So the thing that's interesting about sorting is that any external, there's this well-known lower bound which is that any external sorting algorithm requires that you read and write each data item at least twice in the worst case, and what we say is that any system that actually meets that lower bound has this two IO property; and that's one of the goals that we set for ourselves when we started this work. Now, that imbalance that I just showed on the previous slide can result in extra reads and writes that aren’t necessary, and this is due to what’s called intermediate data materialization, meaning that you don't have enough memory to, for example, keep your entire working set in DRAM and so you end up issuing reads and writes to process that data iteratively. And that's what we mentioned before about you can have your discs running at 100 percent sort of load even though you end up with extra IO’s that are cutting that effective load down to something like one percent. So it's not that in that Yahoo cluster the discs were only being operated at one percent of the time, it’s just that one percent of their performance got delivered into the aggregate performance of the system. And just like the data can cause imbalance the imbalanced discs can lead to this exact same problem for the reason I just mentioned. So we’d like to restore balance; and we do that in two ways, statically, before the job begins, and then at runtime. So we are going to borrow techniques from the database community to sample our data to get a sense of where these partition boundaries are going to be. So this is research, these are things that databases do all the time, and that's how we figure out what these partition boundaries are. The key thing is that at runtime we still need to impose bounds because even if our data has been statically allocated correctly into these partitions, because the on disc layout of the data can have non-uniform data in it we have to handle that at runtime, and that's what I'm going to describe in this part of the talk right now. Okay. So we built a 2-IO storage system called TritonSort that we presented in NSDI 2011, and it is structured as follows. So instead of many fine-grained tasks processing the data in a divide and conquer approach we have two phases of operation. So in the first phase, the distribution phase, we divide our data up into these partitions based on those samples, and then we read all of our data in parallel and we assign each data item to one of these partitions, and in phase 1 we send it over the network to the node it belongs to and we store it in one of these on-disc partitions. So at the end of phase one all the data is on the right node and it’s in the right partition, but each of these partitions isn't sorted. And so in phase 2, in parallel across the cluster, we read in each of these partitions, sort it in memory, and write it back out again. We've also sized our partitions so that we can ensure that each of these are going to fit into memory. So to see that in action we start by reading a buffer data offer input disc. We have a process that's assigning it to these different partitions and copying it into in-memory buffers designed for the different destination nodes, and then when these buffers get full we have some code that sends them over the network to the node they belong to. Then on the receiving side as data arrives we append it to a variety of these on-disc partitions. And just to give you a sense of the numbers involved here, we've got eight output discs on our machine and each disc has about 300-400 of these partitions on it. So you can think of it as 300-400 on-disc files that store the data. In phase 2 we're going to read one of these unsorted partitions into memory, sort it, and write it out. So the bulk of that pipeline, if the details are in our paper, and it’s actually pretty straightforward to implement, and the real complexity of that system is exactly this partition appending module because we have to ensure that we're writing out of data to these discs in large enough batches that the discs deliver good performance. And so I'll just describe very briefly how we do that now. So this module has given us input on the left, a buffer of these key value pairs, and on the right we've got a set of discs, each of them is holding a couple hundred of these partitions and there's a thread before each one of these that’s ready to write out the data to the disc. So the first thing we did was we implemented the kind of most straightforward way of doing this which is to sort of scan through this buffer of key value pairs and rely on the operating system to deliver and manage the IO for us. So we just issued writes or scattered writes or sort of the fancier writes. The result was that the system performed, we had low performance. And the reason for that was really just due to the fact that there wasn't enough buffering handled automatically by the OS to ensure that the writes getting delivered to these discs were sufficiently large to run near their sequential speed. So what we did was we sort of scrounged up as much of the memory as we possibly could, about 80 percent of the memory on each node which was 20 gigabytes, and we managed all the buffering ourselves. So we divided this memory up across all are different partitions, we copied data into these partitions, and when they get full we write them out to disc. But I mentioned that there's this non-uniformity of the input data. And so what ends up happening is these partitions are either really hot or very cold. And so taken as a whole our memory was not particularly well-utilized. So the result of that meant that our writes were not particularly very large. So what we ended up doing was we building a load balancer that ran at runtime in front of our disc and it works as follows, we took that same 20 gigabytes of memory and now we divided it up into 2 million little ten kilobyte buffers that we stick in a memory pool, and so as data starts arriving from the network we basically are going to copy it into these little buffers and stick it into a data structure here. And the way this data structure is organized is that we have a row for each of our 2500 or so partitions, and each of these rows which corresponds to one partition, what we're going to do is we grab a buffer, put our data in there, then we add it to a list or a chain of these buffers per partition. And the nice thing about this data structure is that as partition’s popularity varies during a run even on short timescales, we can extend some of these chains to become longer and then some of the less popular partitions have shorter chains but none of our memory is actually being dedicated that isn’t actively being used. In parallel with this there is a process that's constantly scanning this data searcher and it’s looking for the longest length chain which represents at that instant in time the largest write that we can issue to the disc at a given time. So what we do is once we find this we pull it out of this table, we send it off to this thread which is going to write it off to the appropriate on-disc partition, and then it’s going to take all of these buffers, add them back into the pool, and this is going to be the back pressure mechanism that we use to push back pressure back to the producing side of this pipeline. Now this handles non-uniformity in the actual input data, but I mentioned resources can be non-uniform as well. And so imagine that we have a couple discs that are slower than other discs. The way, this same data structure actually handles the problem without requiring any modifications. So this process is constantly scanning for these chains is only actually looking for the subset of chains that could be issued at that given time. So what that means is if you have a slower disc, the chains behind it are going to build up and you're going to end up issuing larger writes to them which is going to a little bit help mitigate this non-uniformity in some way. Okay. So we’ve looked a little bit of how we handle IO in TritonSort, and I want to talk about our evaluation. So when we began the project, in terms of the hundred terabyte GraySort, the Hadoop, MapReduce Project had been able to sort data .578 terabytes per minute, and then with TritonSort in 2010, 11, and 12 we were able to sort at .725 terabytes per minute using a cluster of 52 nodes and it was just based on these issues that I just talked about. Now, as is the case in any particular type of contest, eventually your record is taken back again. And at least in terms of 100 terabyte GraySort, Hadoop, was able to run at 1.42 terabytes per minute on 2200 nodes in 2013 with a much more recent version of Hadoop. There's some other categories I didn't describe here like the Indy[phonetic] benchmark which you guys took back from us. And so we're working I guess to see if we can to retake that. But I mentioned that it's not just raw performance that we were really interested in. The point of this project was to focus on resource efficiency. And the community identified that as a really important metric as well, and so in 2010 they added a JouleSort category which exactly captures this eco-efficiency and we were able to capture that in 2011, 12, but also maintain it in 2013. And the reason for that is because even though we were beat out in terms of absolute performance, if you just look at the quotient of the amount of work we did times the number of nodes we are able to push about two orders of magnitude more throughput through each of our servers. So that's why we were able to keep that kind of performance. Now I want to briefly mention fault tolerance at this point because any system you build has to be fault-tolerant. I don't have a lot of time to go into the details here, but what I would say is that the fault tolerance approach that you adapt really depends on the failure assumptions that you have in the system. So if failures are really common you want a pessimistic approach to fault tolerance, and if failures are rare you want an optimistic approach. The [inaudible] pipeline I described relies on very aggressive pipelining. So we are not materializing data at all whereas a something like Hadoop requires materializing intermediate data in the common case. And so what I would say is that we looked at some published results, for example from Google in 2010, and kind of what you see is that at sort of a 10,000 node cluster size you're seeing failures like every two minutes or something like that and so it's really important that you need to, you actually want tasks to be able to, you want jobs to be able to survive individual faults. So materializing job state probably a really good idea. We actually talked to the people at Cloudera and the average Hadoop cluster sizes are order of something like 30-200 nodes. And even adding an order of magnitude to this number of nodes you end up with failure rates in this sort of double digit hours, hundreds of hour’s timeframe. And so here what we are going to argue is that it's actually okay to do job level fault tolerance where you re-execute jobs on failure as long as the performance improvement you get by running without fault tolerance is high enough to overcome the occasional job re-execution. There's no hard and fast rule here, and I think there's sort of a dividing line depending on exactly what you're failing rule is but this is something that we’ve been looking at. In the future work that we've been, actually Alex Ratherson[phonetic] which is the lead student on the Trace and Sort Project, is we've been looking at taking trace data and selectively re-executing parts of the pipeline that have failed or that parts of the pipeline that depend on a failure to mitigate the cost of re-executing the sort. Maybe we can talk about that off-line. So, yeah. >>: How much of your gains in terms of I think efficiency come from resource balance versus fault tolerance? Like compared to your competitors? >> George Porter: Well, if we look at Hadoop, for example, there is one extra materialization that happens after the Map task which we get rid of. But actually there are three different places in the pipeline where materialization happens, and two of those places are actually just due to data skew. So what I would say is effectively a third of the IO we get rid of is due to fault tolerance and two thirds is due to resource imbalance and data set imbalance. >>: So why doesn’t Hadoop just do this? I mean are there practical reasons why they don't do it because their developers are jerks? >> George Porter: No, no, no, no, no. >>: I guess what I’m wondering is, so you show like all these very nice results and so you would think the Hadoop guys would say we're just going to nom, nom that, get in there and>> George Porter: The Hadoop people are great. I've worked, so I have a patch that’s been in there since 2008 for this and part of the problem is that, it’s actually concerning, in a sense they are doing this. So there has been a huge move to these in-memory, completely in-memory data processing sort of applications. And in these cases you're getting rid of effectively all the data materialization at the expense of more cost. But I think that, so that's kind of one extreme, sort of getting rid of all the data in materialization. I think though that if you look at actually where Hadoop’s going with projects like Tez, which is like this data flow thing, you're seeing that they're giving users much more control over what materializations they do. So right now you kind of fit into the MapReduce model, but if you're doing something like an iterative job there is now a lot of support for being able to control exactly when those materializations happen. So I think that's happening. >>: But just to follow up on this, you take your own skew that you build, you run one job at a time, so you're really optimized to [inaudible] disc. But if you now have a different cluster with many jobs of different sizes, maybe not a large sort, how, could this directly be applied, would you lose some efficiency, would we have to tweak things to make it really work? >> George Porter: Yeah. So this is great because at a very high level this first part of the talk we are giving up statistical multiplexing, and what we're doing is we are focusing on individual task efficiency. So rather than sort of taking lots of different tasks that have heterogeneous requirements, putting them on a system at a time, and then co-scheduling them, we are dedicating resources without using [inaudible]. And one thing that I would say is that if you've got a petabyte cluster and you have a bunch of 10 terabyte jobs you have a lot of opportunities through doing [inaudible]. But if you're resource constrained, let's say you are a research group, let's say you're a startup, you are working in biology or something like that, you may want to solve petabyte scale jobs but you only have the resources to run petabyte scale clusters. And so the question of is [inaudible] than not [inaudible] is an interesting question when you’re not resource-constrained. But if you are resource-constrained you can't even ask that question. So I think that it's good to focus on things like job and cluster scheduling, obviously if you're doing [inaudible], but it doesn't hurt to also look at can you actually kind of pull in as much efficiency as you can out of individual tasks when you don't have the opportunity to do [inaudible]. >>: But it’s not clear how you're actually [inaudible]. >> George Porter: Yeah. I think>>: If you can get better>> George Porter: Yeah. So just to answer this real quick, if you assume that the compute in the storage are co-located with each other you don't have a ton of choice in that matter. But if you separate them, for example like with the Blizzard work from NSDI last week where you actually get to sort of logically separate storage and compute, you could imagine dedicating a very tightly connected set of machines to storage, getting full bandwidth of that storage, and then when that job’s done now maybe an order of magnitude more computers can access that same amount of storage. So you get light binding on that. I guess it depends. I don't want to run too long, so I'm going to move a little bit forward. Sorting isn’t all we care about, we also care about the data processing. So we built this system called Themis, I'm sorry, so we implemented all of these different applications here and what I want to show, I don't have time to go into the details, what you see on this graph is the performance of our MapReduce system where the Y axis is throughput in terms of megabytes per second per disc in each of our phase, the X axis is all the different jobs that we've done and different levels of skew. What you see is for the vast majority of these jobs we’ve pushed our storage performance similar to our record settings to our performance. Now I said almost all, there is this Cloudburst example where in the first phase of Cloudburst it is IO bound and so we see that performance improvement, but the second phase isn't IO bound which kind of exposes this point we talked about at the beginning of the talk which is that when you get rid of one bottleneck you can oftentimes push it somewhere else. And a place that you typically push it is the network. Now for us this wasn’t a huge problem because Cisco donated one of these big data center switches to our group and we only had 52 nodes and we had enough ports to give full bisection bandwidth, but if you’ve got 150, 2000 nodes that’s not such an easy problem. And that kind of leads to the second part of my talk which is focusing on the data center interconnect. So just like applications have changed, the network has changed quite a bit as well, and we've seen this enormous growth in terms of data rates. So the types of networks that people have built to address this growth and performance has changed quite a bit. My first exposure to these kind of networks was in 1994 when I worked at ISP[phonetic] in Houston and the way that a lot of networks were built then and even today is as these tree type structures. And you've got nodes along the bottom and then you have the layers of switching getting increasingly powerful as you move towards the root. And if you imagine let's say one hundred thousand data center at 10 gigabits a second that's a petabyte of aggregate bandwidth demands. Now the real problem is that you simply can't, from a technology point of view, buy core switches that are fast enough to actually handle all of this bandwidth, and so researchers have actually looked back to the 1940s and taken ideas from Bell Labs kind of in the 40s and 50s and adapted them for data center designs. And so this is what is called a folded-Clos multi-rooted tree, and some version of this is deployed in many types of data centers, and this was proposed in [inaudible] 2008. The key thing here is that we don't have these really powerful switches in the middle of the network. Instead, if we have is 10 gigabyte a second servers, all of the switches on our network are 10 gigabits a second. And we get all of that bandwidth by relying on multi-pathing to deliver an aggregate amount of bandwidth. So if you have enough links in the network you can load balance and distribute traffic appropriately to get a higher amount of aggregate bandwidth, and what you've done is traded off impossible to by switches with a very challenging but solvable with money problem of adding lots of links into this network. So a 64,000 node data center has about 200,000 links in it. And these links are incredibly expensive to kind of deal with installing them and managing them. They’re also very expensive in terms of cost. And as we move from 10 to 40 to 100 gigabits a second of Ethernet they're going to get disproportionately more expensive. And the real reason for that is that we can't rely on the copper cables that we know and love and we have to move to fiber optics. The reason for that is because of a property of copper cables called the copper skin effect, which roughly speaking, says that the faster the data rate of the cable the shorter it has to be. So at a gigabit you can buy spools of hundred meters worth of Ethernet. The second you go to 10 gigabits you're down to order 10 meters, and at 100 gigabits you’re talking about a couple of meters in length. And remember, these are warehouse scale buildings and so we have to overcome this length limitation in some way. And so the way people do that is to rely on optics which don’t have this copper skin effect. So you can send very high-bandwidth, you can create very high bandwidth links at very long lengths this way Now the problem with optics is that you have to have some way to convert between the electrical signals that the switch understands and the optical signals inside the fiber and so you need a transceiver at either end of this cable that has a laser, a photo receiver in it, which is used to make this conversion. And these transceivers are sort of ballpark 100 dollars, maybe 10 watts at 100 gigabits a second. And I know that several of you would have much better precise information about the pricing. This is sort of based on external information and papers and stuff like that. But the point is that they are not trivial in terms of cost, and you need two of these for each of these say 200,000 cables. So it adds up to a lot of money and a lot of power. To look at the implications of that, if we imagine 100 gigabit a second multi-rooted tree here and we look at the path from a given source to a given destination, what we see is that the packets transiting this path are constantly being converted to and from optics at each of these switch hops, at each layer of switching from the leaf up to the core and then from the core back to the leaf. So the implication of this is that for every device attached to the network there’s roughly speaking 4-8 of these transceivers in the network kind of conveying the traffic for that device. And so at 100,000 nodes that's like a megawatt of power and tens of millions of dollars or more. And if we step back from that for a second I think it's worthwhile asking why are we doing all of this packet switching? Why are we doing that? And what I would say is that these folded-Clos networks, the service model they actually provide is that they allow you to make a different and a unique fording decision for each packet that you send in the network. But that service model is, I'm going to argue, too strong for many data centers; and as a result there’s a gap between the service model we are providing and the service model we could potentially provide, and this gap is how we are going to get resource efficiency. And to say what I mean in more specificity, there's a lot of locality in data centers. And actually Microsoft has been great about publishing actual results from your networks. And this is a picture that's reproduced from one of these papers. And it’s a little bit dated at the moment, but it does show kind of the rack to rack traffic at an incident time. And although the details change over time, what you can see is that a bulk of the traffic is going to a relatively small number of output ports. So there's a certain amount of spatial locality in these systems. But if we kind of look bottom up as well, we see that there's a lot of temporal locality as well. And so my student, Rishi Kapor[phonetic], published a paper in Conex last year where he looked at a 10 gigabit server, and he deployed a variety of kind of representative applications on top of it, and he measured the packets leaving that server at micro-second timescales. What you saw was that because of all the batching the happens in applications, in system calls, in the operating system kernel, in the NIC hardware, all of that sort of buffering and batching ends up translating into tens of hundreds of packets that are correlated in nature. So you tend to see when servers send large amounts of data from one place to another they tend to do it in these kind of correlated bursts. And so the key idea behind this second part of the work is to use this temporal and spatial locality to build cost-effective networks by adopting circuit switching in addition to packet switching. So if you're not familiar with circuit switching I'll give you a very brief example. This shows a one input port, two output port circuit switch; and you can think of this as just an empty box that has some mirrors inside of it. And light enters the input port, it reflects off these mirrors, and it leaves an output port. If you want to make a circuit switching decision there are tiny motors underneath these mirrors that can move them and this changes the angle of reflection and causes the light to leave out of a different port. Now this is great. You don't need any transceivers. We're not doing this conversion. And it supports effectively unlimited bandwidth, meaning as we go from 10 to 40 to 100 gigabits a second this technology doesn't have to be changed. But circuit switching is an incredibly different service model than packet switching. And so this isn't just a drop in replacement for your packet switches, you really have to kind of rethink the entire network stack. And to give you kind of an example of that I'm going to talk about one aspect of the service model that has changed which is called the reconfiguration delay. So the reconfiguration delay Delta is the amount of time it takes to change the input to output mapping of that circuit switch. And it's, roughly speaking, the time to move those little mirrors. And this determines how much locality you need for these circuit switches to be applicable in your network. If Delta is really large it means that it's incredibly expensive to change the circuit mapping, there's a very high overhead, and so you only ever really want to support very large, highly stable, long-live collections that in the networking world we call elephant flows. Now on the other hand, if Delta is really small, you could very rapidly reassign circuits on short timescales and so you can support very highly bursting unpredictable traffic called mice flows. Now I want to point out Delta is not fundamental, it’s a technology-dependent parameter. It depends on how you build these mirrors and some other aspects of the technology. But it is a very important parameter that determines this mixture of circuits to packets. Yeah. >>: Did you also account for the delay for mirroring the traffic? You have a mirror of something and make the decision to switch [inaudible]? >> George Porter: Yes. >>: The mirroring time can be much longer than this. >> George Porter: Yes, yes, yes. This is great. So this is talking about the actual data plane. Now you're talking about a control plane issue about how do you figure out what signals to send. I'll describe that in a minute, but we started by an observe, analyze, act approach. That became too slow, and so we ended up with a proactive approach that I'll describe in the talk. This is an exact problem we dealt with. Okay. So we have the sense that the majority of the traffic has locality, but not all of it. So the way you can think about that pictorially is as follows. Imagine that we have been, in a network with N connected devices there's N squared possible connections. So we could rank order all of those N squared connections by the amount of traffic per connection; and because of there’s locality, the picture looks, roughly speaking, like this where a bulk of the traffic is in a relatively small number of these connections. And so this leads to what we’re saying is a hybrid design where we are going to rely on both circuit switching and packet switching. So what we like to do is take the head of this distribution and send it over these circuit switches, and then this relatively long tail that has a lot of different connections but not a lot of bandwidth we’re going to send over a less expensive, lower speed packet switch network. And it’s exactly this Delta value that determines this mixture of packets and circuits. Okay. So I told you Delta is technology-dependent, and when we started the project we had to get some sense of what that value was, so we obtained an optical circuit switch that was developed in the late 90s for the Telecom industry and we characterized it in our lab. And what we found was that the Delta value is about 30 milliseconds, and what this means is that you need to keep circuits up for hundreds of milliseconds to seconds or longer to amortize that overhead. And so it’s really only appropriate for very highly stable, long-live traffic, and the place you see long-live traffic in the network, generally speaking, at the core we have a lot of aggregation. And so this led to the development of the Helios Project which was presented in [inaudible] 2010, and I showed you that multi-rooted tree before. Imagine we're just focusing on the core switching layer only and we're going to get rid of most of the packet switches in that switching layer and we're going to replace them with a smaller number of these circuit switches; and let’s abstract the rest of the network away into these things we call pods, which are roughly speaking, about 1000 servers or so. So the servers, the links, the switches, these are in these pods. The idea behind the Helios Project was to support this type of an environment. And so what we do is we start by sending traffic over our packet switches and then there's a process that's looking for these elephant flows, and whenever it finds one of these elephant flows it’s going to add updated flow roles down here to move it over to the circuit switch. So this is what you were talking about, about the time it takes to do that. Because we've got 30 milliseconds it’s like all the time in the world. So it's not a particularly big deal. So we end up, there's details in the paper about exactly how we do that, but finding those elephant flows, moving them over to here, we can achieve all of that in this kind of tens of milliseconds time bound. Yeah. >>: I feel like there's something of a tension between the two halves of your talk which is that the first part of the talk is about efficiency, which by definition will try to use all things equally, and it seems like it's exactly the opposite of what you want [inaudible]. >> George Porter: Yeah, yeah. So there's a commonality which is that this is also getting rid of [inaudible], if you want to think of it that way; but to get to your specific point, I think that one of the main things here is that what we're doing here is we are in a sense trying to build a network that matches the average case utilization even though that average case is rapidly changing and the set of nodes that need high bandwidth is also rapidly changing. Today, you're really only option is to effectively provision for the worst case. And so in this way we are able to, as applications, as their communication patterns change we are able to migrate things. But even in that TritonSort case, if you think about the two phases, in phase 1 we were fully utilizing the network. But in phase two we actually had no network at all. This model would allow us to take that resource away and move it to another instance that does need the network. So if we were over to overlap phase ones and phase twos in two different clusters we could actually share the network between the two. All right. So the result of the Helios Project was that we were able to get rid of one of these transceivers in the core which doesn't seem like a big deal, but it actually represents a very large cost complexity and power savings because when we looked at this original network we were looking at 10 gigabit networks and so all of these pods we could entirely interconnect them internally with electrical cabling. You only really needed optics for these core switch layers and that's where all the transceivers were. But as we want to start moving to 100 gigabits a second we’re not going to be able to make that assumption anymore because we’re going to have to start putting optics inside of these pods just because of the length limitation. So we need to start pushing circuit switching closer to the host into these pods, and that led us to our second project which is called Mordia. And I want to say that the thing is that if we were able to aggregate over 1000 servers with this 3-D MEMS technology that was relatively slow, if we want to put circuit switching down to the host we need a technology that's, roughly speaking, 1000 times faster. And so we identified such a technology which is a different kind of circuit switch device called binary MEMS. It's a little bit different, and what I will say is that the advantage of this binary MEMS technology is it's very fast. It's about two microseconds seconds, three orders of magnitude faster, but the downside about it is it's not scalable. You can only buy switches that are maybe four, eight ports in size. Yeah. >>: [inaudible]. Are you making a strong assumption where predictability [inaudible]. There is [inaudible] keep informing [inaudible] mislabeled mice and elephant you're getting a [inaudible]? >> George Porter: Yeah. Implicit in this particular design is the idea that what we need, because Delta is so high, we are only going to consider traffic for this that's stable for over a second. So these little bursts, too fast because it’s 30 milliseconds just to assign a circuit. And so for this the only kind of traffic that we can actually support this is traffic that’s going to be stable for order a second or longer. Now with this technology, because we can reconfigure it in two microseconds, we actually only need traffic that’s stable for about 100 microseconds to assign a circuit to here. And so that's one of the major issues here is that what we mean by circuit traffic or locality traffic depends on the Delta value. And so now we can actually support the burst of a given server using circuits. But, in getting to this point, we can't do things like measure assign the flow roles, etc. because we only have a couple microseconds to do that. So we have to be proactive, and that’s what this project deals with. Yeah. >>: [inaudible] of the flow is very, like a few microseconds. That's going to [inaudible] overhead on the switches. >> George Porter: So we don't, so instead of estimating traffic based on the looking at packet counters and switches what we actually do is we measure demand by looking at hosts. So we actually look at the hosts, what the send buffers and the hosts are to see what their demand is going to be in the future. And we can>>: And you can collect the statistics from the host and that could also take longer than [inaudible]? >> George Porter: So it takes order microseconds, tens of microseconds let's say, to have the host send this data out. And I don't have time to get to it in the talk, and I actually don't have slides to this, but it turns out that in this Mordia design you can think of a pair of ToRs, two ToRs connected to each other, so they only actually have exchange information on a pairwise basis. So you don't have to collect this globally, do a decision, and send it back out again. Maybe we can talk about that off-line. >>: Sure. So [inaudible]>> George Porter: I have a couple more minutes to go and then we can get to all these questions. And in fact, there's details in the paper about how we built Mordia, and I'm actually going to skip over how we did it, but the key idea is that these switches are able to support multiple wavelengths of light. And by making a copy of the light by tapping some of the light out of a fiber we can actually replicate the signals across multiple of these stations and each of these switches can make orthogonal switching decisions. And this is the key idea that we used to scale up our design. And by adding a variety of these switches into one or more of these ring networks you're able to support order 600 ToRs or so with this design. Okay. Now we built Mordia over at UCSD using these switches that we got from this start up, and we connected them to our servers, and we measured the switching time of the compose system, and it is in fact these two microsecond results. We are able to keep that fast switching time even though we scaled up to, in this case, 24 ports. The key idea here is that with two microsecond switch time we only need 100 microseconds of stability for something to be circuit-friendly, and that means we can actually support the traffic of a single server. And this led to our most recent project, which we just presented last Wednesday at NSDI over across the water there, which is exactly building a top-erect switch, hybrid switch that speaks to both circuits and packets at the ToR layer. And this is the premise of that project. So it's very simple. If we've got a 10 gigabit packet switch network and we overlay a 100 gigabit circuit-switch network into our data center, these effectively can be put together to build a 100 gigabit packet switch for data center workloads, meaning that if there is sufficient temporal and spatial locality defined by 100 microseconds worth of bursts then you can deliver service model akin to this extremely expensive network using two much less expensive and lower cost network technologies. And we built REACToR using this as our, we built an eight port REACToR prototype and we hooked it up to our Mordia network, and I just want to show you one of the graphs from that paper and then I'll sort of conclude. So the idea behind REACToR is to give the performance akin to a 100 gigabit packet switch but using circuit switching. So what we did is we deployed eight nodes and we have seven of the nodes sending data to the eighth node, and this is the view from the eighth node. So what it's seeing is that x-axis is time in seconds and the Y axis is throughput, and what you're seeing is each of the incoming flows is relatively stable, very nicely fair, very uniform, looking like kind of a very smooth packet switch network, but in reality if we zoom into this at microsecond timescales what we see is we are actually rapidly multiplexing small bursts of data from all of these different hosts and delivering them to the end host. The key idea is that we are able to rapidly multiplex that link fast enough that the transport protocol and the OS doesn't realize anything’s going on. And the analogy here is to process scheduling where you can just schedule stuff fast enough nobody notices that they don't have access to that resource. So the key idea behind this line of work has been to focus on the predominant sources of cost and power in these networks which is very surprisingly and sort of counter-intuitively cabling costs and transceiver costs. And so we built a variety of projects that have dropped that number down to close to one transceiver per host at 100 gigabits a second, and what's nice is that as we go even behind 100 gigabits a second this same approach should be able to apply as well. And I just briefly want to mention that what I've described thus far has been taking what our existing building blocks and prototyping them to build new types of networks, but we are also wanting to complete that loop to build new building blocks as well. And so we started with commercial technology, we built some prototype technology, and now we are interested in building novel devices designed for data center environments. So we are doing that inside of NSF. There’s an engineering research center called CIAN which is about 12 institutions and about 30 PIs most of which are photonics and physicists. So they’re photonics people and physicists, and the idea is that we are taking all of these organizations and actually going back to a lot of the building blocks we used were designed for the Telecom industry. So what we're doing is we’re kind of unwinding that decision tree back to the assumptions that were made in building optical devices, and instead of targeting them towards the Telecom world what we are now doing is targeting them towards the data center world which has a very different set of assumptions. So there are people in the center that are building new devices, and this is one example that sort of has come together in the last month and then I'll conclude. So I mentioned in the Mordia design that we have these binary MEM switches in the network that are making switching decisions, and they're basing those decisions on the wavelength of light that is entering them. Now that's kind of an expensive design actually because these switches are very expensive. Now what you can do instead is there’s researchers in the center that have been working on silicon photonics tunable lasers, which sounds very sci-fi but this actually not much more expensive than building regular transceivers today, but the cool thing about this is that by changing the frequency at the source you actually don't need those switches. You can build an entirely passive interconnect network and simply have the sources change the frequencies that they transmit on. And so, through the center we were able to take that research and send it to a fab and build it onto a chip and then partner with this company in Berkeley to package that in an SOP module that we can then reinserted into the Mordia switch. And so that happened about three weeks ago, and so that's kind of what the future work is for this project which is to lower the cost of that network. Okay. So in summary, it's important to sort of pivot away from the question of scaling to scaling in a resource efficient way, and I’ve talked a little bit about IO efficient data processing, data-intensive networking. And before I conclude I want to acknowledge the incredible students that I’ve had the opportunity to work with that have been driving much of this research. And with that I’d like to thank you for your time and open the floor to questions. Yeah. >>: It's good that you got the switching [inaudible] so small, but we’ve also found that if you're willing to deal with extra traffic hops that you could kind of get away with the need for a lot of switching by [inaudible] a lot of effort. >> George Porter: Yeah. >>: Have you used any of that for>> George Porter: Yeah, yeah, yeah. So this is things like OSA and other projects that rely on kind of overlay or multi-hop. It's another degree of freedom. So all of the designs that I’ve talked about have been effectively either zero hop or one hop depending on how you look at it. And the second that you can start forwarding traffic through intermediaries, maybe using things like RDMA or something, it gives you this additional degree of freedom where now you have a scheduling decision which is, do I wait until I get a circuit assigned to me, or do I sort of send data to some intermediate point which can then send it on my behalf? And it's actually quite interesting because nothing that we’ve talked about precludes that, it's just that that hasn’t been something that we've been focusing on. >>: You mentioned a few microseconds. That doesn't make sense to really use it for them? >> George Porter: So this is an interesting point because we were servicing in some sense fundamental here. I think that there's an interesting sweet spot at the kind of 0.1 microsecond timescale because at 10 gigabits of packets, you know, 1 .2 microseconds or so, and in order to use circuit switching, I should say we've moved, we haven't looked at optical packet switching at all. We've really been focusing on circuit switching because it's something that’s practical and that can be deployed. And so what you really need is a burst size. And if you're talking about 10 or 40 or even 100 gigabits a second, one microsecond gives you reasonable burst size that seem to match well with what servers are able to generate. If we were to push that switching speed much lower we would end up kind of hitting up to the packet boundary and building effectively a packet switch which isn’t really what we want to do; and if we pushed it to be a higher value we would end up needing so much burstyness[phonetic] that it would be very difficult to get servers to be able to generate that. So it happens to be a particularly attractive spot at sort of 0.1 microsecond or so. So we haven't looked at trying to make switching faster. There are technologies that could do that. There's SOA, switches and stuff that operate in nanoseconds, but from our point of view we lose our circuit switching benefit from adopting those techniques. >>: So why aren't people building just like, almost like RF-inspired optical switches. Like RF doesn't need any switching. You could just have the transceiver side they just try things. >> George Porter: So you mean like wireless? >>: Yeah. >> George Porter: Well, I would say, I mean the work on say 60 gigahertz wireless in the data center in a sense needs switching because you're either physically moving things or you're somehow choosing a different target to transmit to I guess. >>: So that’s [inaudible]. I was thinking like an optical switch that instead of like, I think part of the thing it seems to me like this scheduling control plane idea is [inaudible]. It has complexity. >> George Porter: Yes. >>: And that complexity is coming from the fact that you have this kind of switching schedule like no matter how fast it still [inaudible]. >> George Porter: Yes. That's right. >>: That doesn't show up like in cellular domain or whatnot. Like my phone can't [inaudible] at any time without actually having like a global schedule across all transverse>> George Porter: Well, you channel mitigation, so that's one thing. Imagine this network, so with optics it's like a wireless network where every terminal’s a hidden terminal. So think of it that way. Imagine a wireless network where every node was hidden. The tunable laser idea that I just talked about, one of the things that's interesting is that you end up in a situation where if two, let's say ToRs choose the same frequency to transmit on that interference will cause data loss. And so what we're going to do to solve that is a very simple kind of brute force approach which is to create a registry service wherein you opportunistically acquire a channel and then you register with the service that you get the channel, and if two devices end up conflicting on that one of them will win and the others will stop sending, and we can bound the amount of time given the amount of, like the slot time, the contention time can be small. This is just techniques from the wireless domain, and then we can code over that to make use of, so we don't lose data. So this is very much like a carrier sense approach, just in the optical domain. My understanding, and I'm not a physicist or an optics person really I’m a computer scientist, but my understanding from talking to the optics people, because asked a lot about like their CDMA, for example, could we do OFDMA or whatever, and my understanding is it's not a very promising approach. But that's a little bit out of my domain. >>: Just to be clear what he's saying, couldn’t you choose a different lambda? That's what he's talking about. >> George Porter: Oh yeah, that's what we do. We pick>>: [inaudible] do it that way? >> George Porter: Yeah. So our approach is basically that you give it a list of in preference order of the lambdas you want and it tells you back you can have three and you can have five or whatever. >>: But presumably, there are fewer lambdas than N squared. >> George Porter: Oh yeah, yeah. Absolutely. So the number of lambdas>>: [inaudible]. >> George Porter: Current technology, the number of lambdas is 0 of 100. Think of it that way. >>: [inaudible]? >>: Not for N squared but just [inaudible]. >> George Porter: If you want to support say 600 ToRs, let's say you'd need to add a space switching into that as well. That's what we're going to do. Because can't really fit more lambdas into a fiber, but you can have multiple fibers effectively, and what you can do is choose which fiber you’re going to send data on and then which frequency you're going to send within that fiber. And that's where I was mentioning that point about how we don't need a global scheduling decision because we do need a global scheduler that decides what ToRs are connected to which other ToRs, but once those two ToRs are connected to each other the specific frequencies that are being used can be negotiated on a pairwise basis which is a much simpler problem to deal with. Yeah. >>: [inaudible] hybrid [inaudible]? [inaudible] >> George Porter: Yeah. So this is actually a great point. There's two things I can say about that. One is that this stuff is super reliable because it’s built by the Telecom industry; it's very expensive. So they had these goals of like 10 to the minus 18th bit error rates. The reason that this stuff is expensive, and I mentioned before that we’re winding back that decision tree to build devices that are [inaudible] to data centers, it’s just we don't need to 10 to the negative 18th error rate. We have other ways of dealing with errors that if they made the network substrate significantly cheaper and more integrated it might be a good trade-off. So I what I would say is the Telecom stuff we are using has been incredibly reliable. And then there was another point that I was going to say about, so, yeah. The technology we’ve used has been fine so far. But in terms of handling failures in this model, because of this 100 channels per fiber, if we moved to a multi-fiber or multi-ring type network the way you can think of that is that I'm now spreading traffic over multiple rings; and so if one of these rings were to fail in some way that proportionally reduces my bandwidth by N minus one. And so there’s kind of nice failure recovery model there as well. So you don't lose, it's all or nothing. You can imagine degrading our service based on the number of failures that you get. So if a laser fails or something like that you might lose a fifth of your bandwidth or something like that. But you don't lose 100 percent of your bandwidth. >>: Have you also seen [inaudible] same thing. [inaudible]? >> George Porter: I don't have any concrete quantitative data on failure rates between the two of them but what I would say is that the optical stuff, especially the stuff in the Telecom industry, is a really reliable. So there are failures that occur, of course, but it hasn't been a big problem for us. Now one of the things that I will say is that these tunable lasers, the way they actually tune is that they have very small heaters next to the laser that change the temperature because the frequency that they transmit depends on the temperature. Now the flip side of that is in the data center if you changes in temperature you have to have some way to stabilize that temperature. And so one of the areas that's really, that a lot of the optics people are working on is on devices that don't require active cooling and stabilization of temperature. But, yeah. We didn’t run into that problem at all. So we have a sort of a data center like the size of this room at UCSD and it’s got some chillers and stuff in it, and we've not experienced any failures in three years. But we are very small scale. >>: For example, like vibration effects [inaudible]? >> George Porter: No. Nothing like that. We also, just as a side point, for the sorting record we have 1000 spinning discs. We never saw any vibration effects at all in the time we used them. Yeah. >>: You measure the scale effect copper? >> George Porter: Yes. >>: At lower frequencies they’re fixed or making error. So replacing a single cable by a grade of [inaudible] cable [inaudible] or does that get too ridiculous as the frequency? >> George Porter: My understanding is that it does. And again, you all are the experts here, but the labor involved in the plugging all these cables in is very nontrivial. It’s gotten to the point where organizations like Google are building these robots to build cable assemblies so they can plug one layer of switching into another. And the second you say I want to take 10 big fat cables and put 500 of them together into a bundle this big, I think people don't like that. >>: All right. Let's thanks George once again. >> George Porter: Thank you.

Document 17865055

Related documents

Products

Support

Document 17865055

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib