>> Ratul Mahajan: Hi. Good morning everybody. Thanks for coming. It's a great pleasure today to have Andy Curtis who is visiting us from the University of Waterloo and he's going to tell us about a bunch of exciting stuff that he's done at Waterloo and with some HP collaborators around managing and operating data center networks. >> Andy Curtis: Thanks. Good morning everyone. Today I'm going to show you how to reduce the cost of operating a data center network by up to an order of magnitude. Now because the data center network is 10 to 15% of the total cost of operating a datacenter, this can result in pretty significant cost savings. The datacenter has been described as the new computer and the network plays a crucial role in this new computer. It interconnects the compute nodes, so we need a high-performing network. In particular, network performance is critical for doing things like dynamically allocating servers to services. This allows you to dynamically grow and shrink the pool servers assigned to a service, so you can maximize your server utilization. If you don't have enough bandwidth in the network, then you need to statically assign enough servers to your service to handle peak load, and this results in very underutilized servers which then costs more because you have to buy more servers. Another thing that a high-performing network is useful for is doing things like quickly migrating virtual machines and this is also useful for service level load balancing, and if the network doesn't have enough bandwidth it can be a serious bottleneck in the performance of big data in analytic frameworks. For example, in the shuffle phase of a MapReduce job of 2 terabytes of data are transferred across a network. Now when designing any sort of network we need to take into account the constraints and goals of the target environment. So the datacenter has a few new things to it. First of all it is a huge scale. The network needs to be able to interconnect hundreds of thousands of servers and with very high bandwidth as I just mentioned. An additional lesser considered requirement of the datacenter is that the network needs to handle the addition of servers to the datacenter. So this is an aerial view of the Microsoft datacenter in Dublin and these white units on the roof are modular datacenter units, and they probably each contain about 1000 to 2000 servers. You can see that the roof is about a third of the way built out, and so as we build it out there is going to be significantly more servers added to this datacenter. >>: The computers are in little boxes on the roof? >> Andy Curtis: They are in modular… >>: What's inside that big building? >> Andy Curtis: That's a good question. So there is also a traditional raised floor data center within, but this is sort of a, I guess they're trying out this different architecture. If we're designing our network to handle this sort of growth, the network needs to be able to have incremental expandability. If we don't account for the fact that our datacenter will grow and change over time, then our network could end up being a mess after several years. So what we need is a flexible datacenter network. >>: [inaudible] the same Microsoft [inaudible]? [laughter]. >> Andy Curtis: No, I'm not sure. But I don't think that this is a Microsoft data center. [laughter]. So if we consider the traditional datacenter network topology such as the Clos, flattened butterfly, HyperX, BCube and so on, these topologies are all highly regular structures, so they are incompatible with each other and they are in incompatible with legacy data centers. Let me just illustrate this with a simple example. So this is the standard fat tree. Here each switch in the network has four ports, and then these yellow rectangles represent racks of servers. So let's suppose things are going well, so over time we need to add a couple more racks of service to this datacenter. The question here becomes well how do we Rewire this network to support these additional servers. If we want to maintain the fat tree topology, we have to replace every single switch in this network. That is not cost effective and it could result in a significant downtime which is unacceptable in a data center environment. So we need flexible data center networks. Additionally, data center networks are hard to manage. Go ahead. >>: The main problem is that it was [inaudible] and if you have [audio begins] to build out the network that has [inaudible] ports and you can still add more chores and [inaudible] later and so without actually replacing the core. >> Andy Curtis: Right, so that is certainly true that, I mean, I chose this example to be very bad. This is the worst case. However, even if we did sort of plan ahead for the growth, we are spending a lot of money up front that we don't necessarily need to, and ethernet speeds are increasing faster than Moore's law, so we can sort of ride the cost curve down if we can delay deployment of additional capacity until we need it. So besides being inflexible, data center networks are hard to manage. This is primarily because of their huge scale. They can consist of up to tens of thousands of network elements, and additionally they have a multiplicity of end end paths, so this is useful for providing high bisection bandwidth and high availability, however, it makes traffic management quite challenging because traditional networking protocols were not built to handle this multiplicity of end to end paths. So let me just summarize challenges briefly. I've identified these two challenges first, designing a new upgraded or expanded data center network is a hard problem, and the second is then managing this network is also challenging, and that's mostly because these networks are very different than enterprise networks. So to resolve these challenges, I have made the following contributions in my dissertation. First, I developed theory to understand heterogeneous highperformance networks. By heterogeneous I mean that the network can support switches with different numbers of ports and different link rates. Second, I've developed two optimization frameworks to design these types of data center networks. The first is the framework design heterogeneous Clos networks, and I will get into exactly what those are in a minute, and the second design is a completely unstructured data center network, so these are arbitrary mesh networks, and I will get into why we want those. And then the third contribution I propose, I made is scalable flow-based networking in the data center, so this allows you to manage the individual flows in your network using software running on a commodity PC, and the application that we're going to use with this is to do traffic management in the data center. So to describe my first two contributions to you I am going to describe these two optimization frameworks that I developed. The first I call Legup and as I mentioned this design is heterogeneous Clos networks. The second I call Rewire and this designs unstructured data center networks. So both of these are optimization frameworks for datacenter network design and as input they take in a budget, which is the maximum amount of money that you want to spend on the network. If you have an existing topology they can take that in so they can perform an upgrade or an expansion of that. Additionally, they take in the list of switches so this needs to have the specifications and prices of the switches available in the market. If you include modular switches you have to include also the details of line cards, and then optionally they can take in a data center model. This is a physical model of your data center so you can describe the rack by rack configuration of your data center and these frameworks will take this into account to do things like estimate the cost of links, so like a link that attaches to two adjacent racks should be cheaper than a link that crosses the length of the datacenter. So my frameworks take this input and they perform an optimization algorithm to find some output topology. Now when we started thinking about this problem we started with this hypothesis, and that is that by allowing switch heterogeneity we would be able to reduce cost. And the reason we made this hypothesis is because of the regularity and the rigidity of existing constructions don't allow any heterogeneity in your switches, so by allowing this we believe that we can come up with more flexible networks that we can then expand and upgrade more cost-effectively. So when deciding the output topology for an optimization framework for the first pass we decided to constrain this output to sort of a Clos-like network and I called this the heterogeneous Clos network. Again, by heterogeneity I mean that you can use switches with different numbers of ports, so that is their radices can be different and we can have different link rates. >>: So in thinking that heterogeneous switches would reduce costs, are you taking into account in any way that the additional, or the higher cost of managing the heterogeneous set of switches? >> Andy Curtis: Right, so, I am not taking that into account. I'm taking into account the additional cost to build it, but not to manage it. Yes? >>: So when you have a DCM which has a heterogeneous device, have you considered like if you buy another device of the same type you can buy at a much lower cost than buying many different kinds of devices, especially when you need to customize each device based on your needs it is very difficult to ask model vendors to customize their based on your needs. >> Andy Curtis: Right. So I have considered that. I don't explicitly include it in my model right now, however, you could very easily extend the framework to include that to where you can buy at bulk discounts. But I don't consider it for now. When I'm describing this, I'm going to go into describing the theory of heterogeneous Clos networks now, and then I'll show you how to actually build these things with an algorithm. So while I'm describing this I want you to assume that we can route on these networks and that we can do load balancing perfectly across them, and then later on I will show you how to get rid of these two assumptions. So to review the Clos network, it looks like this and this is what I call a physical realization of the Clos network, because it represents the physical interconnection between switches. But it turns out that there is a more compact way that we can represent this, and that is by collapsing each of these bipartite, complete bipartite subgraphs into a single edge. So we have something that looks like this. And I call this the logical topology. So here each logical edge represents this complete bipartite subgraph, and the number on it indicates the capacity of the underlying physical network. It turns out that for a Clos network the logical topology is always a tree, but I started thinking about this and I thought well why can't we separate this and deploy the capacity across a forest of trees? And so the problem becomes now if we can split the capacity like this, the problem of designing a heterogeneous Clos network is that we are given a set of top of rack switches and each rack has a demand which this is the uplink rate that it would like. Here you can think of this rack wants 4 gigabits of uplink and over there they want 64 bits of uplink. This rack of servers should be able to get this uplink rate regardless of the traffic matrix, so this is also called the hose model. It turns out that for this set of top of rack switches there are three optimal logical topologies, and by optimal I mean these topologies use the minimum amount of link capacity necessary, sufficient and necessary to serve these demands here, so optimality at least in theory is only on link capacity. Now there, so for any given set of top of rack switches there can be a bunch of different logical topologies, so our first result is how to construct all optimal logical topologies given a set of top of rack switches. Then once we have these logical topologies we need to know how to actually translate them back to a physical network, and so that is our second result is that given a logical topology we find all physical realizations of it. For this logical topology here is one physical realization of it. Question? >>: Back to the, two slides back now. I am trying to understand is if, the thing on the right is very irregular you might say, you are sending down, the guy that wants 64, you're sending 8 in one direction and 56 the other, is this showing that X1 and X2 are different types of switches or what optimization went into that? >> Andy Curtis: X1 and X2 are just these logical nodes that represent a physical set of switches, so there can be different ways to represent these with physical switches. The intuition here is that these nodes need to send 64 units of traffic and be able to send that anywhere in the network, so, but because these nodes only need 4 units of traffic then we don't necessarily have to send all 64 of that connecting to these guys. If we get 4 connecting to those guys then that's enough to serve all of the different traffic matrices possible. So the sum of uplink bandwidth is the same in all of these logical constructions; it's just that we distributed it differently. Yes. >>: So are you assuming that the traffic demand is static? >> Andy Curtis: No, I am not assuming that the traffic demand is static. I am assuming that this hose model which is actually a polyhedron of traffic matrices, so it's an infinite set of traffic matrices and it's all the traffic matrices that are allowed under these rates here, so as long as this rack never sends or receives more than its rate, then that is a valid traffic matrix. >>: So that is the upper bound of the traffic matrix? >> Andy Curtis: Yes, so we are sort of optimizing for the worst traffic matrix possible. Again, this is the physical realization of this and here are physical realizations of the other topologies. So you can see that we just essentially spread the capacity out across a certain number of physical switches. Then the link rates are determined by the logical edges. To summarize, the first result is how to construct all logical topologies. The second is how to translate a logical topology into all of its different physical realizations. Together they give us a theorem that characterizes these heterogeneous Clos networks. As far as I am aware this is the first optimal topology construction that allows heterogeneous switches. Now this theory is nice and it is very elegant, however, it doesn't tell us how to actually build these networks in practice because the metric for a good topology under the theory is that it uses the minimal amount of link capacity, but in practice we have to need to take other things into account such as the actual cost of the devices. In practice a data center network should maximize performance while minimizing costs, should also be realizable in the target datacenter, so this means that for instance, if we have, in order to realize the topology, we have so many switches that it's going to draw too much power, that doesn't do us any good if we can't actually build that. Finally if we are talking about upgrading or expanding a data center network, our algorithm should be able to incorporate the existing network equipment into the network if it makes sense to do so. >>: [inaudible] proof of the theorem. I'm wondering why you are not running into a complication [inaudible] associated with bin packing or such problems. >> Andy Curtis: Okay, yeah, that's a good question. Why don't we have to worry about bin packing here? And that's again, because I am assuming that the load-balancing is perfect so that we can split flows. So if you have splitable flows you don't run into this bin packing problem. You can just solve using linear programming. Actually we can solve it analytically is the way we solved it. >>: So those should be able to, you can arbitrarily split those in any switches? >> Andy Curtis: Exactly. So by arbitrarily, that's why I'm assuming this for the theories, because it makes the theories much easier. >>: Do you know any switches that can proportionally divide the load across multiple [inaudible]? >> Andy Curtis: I don't know any switch that can do that, which is why we wouldn't do that in practice, and that's why at the end of my talk I'm going to talk about traffic engineering, like doing flow scheduling, so that we can maximize throughput even with this type of topology. Yeah? >>: I'm wondering if you're taking failures into account because a lot of data center networks are built so that the failure in any one switch has minimal impact on the network as a whole, but it seems like if you have X1 and X2 and X2 is handling 90% of the traffic and X1 is only handling 10, a failure of X2 would cause a disproportional disruption. >> Andy Curtis: As far as failures, the way we handled that is in the optimization algorithm; there are different ways you can formulate it, but I formulated it as a constraint in the optimization problem that says each rack must have this many, this much capacity if there is a certain number of link cuts. Yeah? >>: Why don't you also have flexibility as one of the goals? Maybe that rack in the link will be eight [inaudible]? >> Andy Curtis: Right. So when I originally did Legup, I did in fact have the optimization criteria was to maximize bisection bandwidth plus flexibility where we had some notion of flexibility, but it turns out that I think it's really hard to capture flexibility in a nice simple metric, so that's why for my second step in Rewire I abandoned that because I couldn't find a good formulation of that and I would be happy to know if you have a good one. >>: The topology looks more like an asymmetric topology. If you make any change to your topology, can you still maintain optimality of the topology, or do you have to rerun optimization [inaudible]? >> Andy Curtis: Right. The changes, I want to emphasize that this is just the theory and, you know, when we actually design these with the optimization algorithm, it does try, if you need to make changes, it tries to minimize the cost of making those changes. It tries to, we take into account the cost of rewiring things and so on, so as a human ideally my goal is to not have to think about that and let algorithm think about that for you. >>: I see an advantage in the [inaudible] approach with flexibility of making changes [inaudible] accommodate [inaudible] devices and just come up with one topology. Can you show something that you are doing to make changes to the topology? >> Andy Curtis: Yeah, so you can flexibly, so these are more flexible in the sense that you have with the Clos you have one configuration, so for these I've actually just shown one way of physically, for the set of top of racks, we had three different logical topologies and actually I only showed you these three configurations, but there are a bunch of ways. So because of this additional flexibility, like say you need to make one change here, you have a lot of topology options that are still like optimal under like link capacity constrained, whereas, with the Clos you only have one arrangement. So that's why it's so much more flexible is because there is this multiplicity of topologies that are also optimal. I want to emphasize that the algorithm doesn't require the optimality of the topology; this is just what it aims to do. >>: At this point you have any examples to show that given this traffic matrix you can, Clos has this topology to handle [inaudible] the heterogeneous Clos can come up with this topology to accommodate this traffic demand, and in the future if this traffic demand change is it cheaper to accommodate for the new traffic demand than [inaudible] regional cost? >> Andy Curtis: I guess I don't have a specific example for you right off the top of my head. I don't think it would be hard to find one though, and I'll show you our experiments of using this stuff at the University of Waterloo's data center and we do actually find significantly lower cost solutions. Like I said, that's a nice theory, but we now need an optimization algorithm to actually design these sorts of networks. So I'm just going to briefly go over the Legup algorithm. What it does is it performs a branching balance search of the solution space. Normally with branch and bound you can guarantee the optimality, however, we can't quite guarantee that, because we have to use some heuristics to map switches to racks area do we do this to minimize the length of cabling used, and then this algorithm does scale reasonably well. In the worst case it is exponential in the number of top over racks and the number of switch types; however, in my experiments I didn't find this behavior. For a 760 server datacenter, it took about 5 to 10 minutes to run the algorithm and this is for the hardest input I could find. If you give it an easy input, for instance, if the top rack switches are homogenous, it only takes a couple of seconds to run. For a datacenter 10 times that large it takes a couple of days, but my implementation only runs on a single core and it would be easy to parallelize or distribute this. >>: [inaudible] heuristics why don't you just add a cost factor or a constraint for [inaudible] the optimization [inaudible]? >> Andy Curtis: I did, so you're seeing in the formulation of the problem the cost. So I didn't do that just to avoid the additional complexity of that, however, I think that given these pretty good runtimes it would probably be possible to do that, but I haven't explored it. So to summarize Legup, I developed this theory of heterogeneous Clos networks, implemented the Legup design algorithm and then I evaluated it by applying it to our data center and I'll show you more results later after I describe Rewire, but for now I will spoil some results and say that for our datacenter it cuts the cost of an upgrade in half versus the fat-tree. So let me move on to Rewire. With Rewire we are going to do away with the structure of the network entirely and design entirely unstructured networks. I'm really motivated by this question of, so it seemed like the structure was hurting us in a Clos network, and by allowing some amount of flexibility we could do a lot better, so if we just use an arbitrary mesh how much better can we actually do? The problem here is that now we have a really hard network design problem. The heterogeneous Clos networks are still so much constrained so we could still sort of iterate through all of the different possibilities and evaluate them, but now we have completely arbitrary mesh so there are many, many, many different networks for any different set of top racks, so to solve this I used the simulated annealing algorithm, and the goal of this algorithm is to maximize performance and by performance I mean bisection bandwidth minus the diameter of the network. So if you don't know what bisection bandwidth is right now, I'll explain exactly what that is in a minute, and the diameter is the worst case shortest path between any two top of racks. I'm using diameter here as sort of a proxy for latency, because latency is actually very hard to estimate. You need to know queuing delays and so on, so that's why diameter is just a proxy for that. >>: [inaudible] different units? >> Andy Curtis: These are different units, so what I do is I scale it to be between zero and one, so diameter, you can think that the best diameter in the network is one, one hop between all of the nodes; the worst is a path, so you can scale that to be between zero and one, and then bisection bandwidth I normalize this as well, so then you can weigh each of these by however much you want. It does take some playing with the weights to get which you want, but because you can tweak it. >>: So the answer [inaudible] weighting to produce an arbitrary scale factor? >> Andy Curtis: Yes, exactly. So then Rewire maximizes the performance subject to the same constraints of Legup subject to the budget. Your data center model if you give it one, but now we have no topology restrictions. Here the costs we take into account are the costs of any new cables you may have to buy, the cost to install or move cables and then the cost of any new switches that you may add to the datacenter. So Rewire performs standard simulated annealing, so at each iteration it computes the performance of a candidate solution and then if that solution is accepted, it computes the next neighbor to consider and so on, and repeats this until it's converged. Now we do have some heuristics for deciding the next neighbor to consider, but I don't have time to cover that because I want to talk about how to compute the performance of the network. It turns out there are no known polynomial time algorithm to find the bisection bandwidth of an arbitrary network. So the bisection bandwidth is the minimum bandwidth across any cut in the network, and we can find the bandwidth of a single cut pretty easily. Let me denote the servers on one side of this cut by S, the others by S prime. Then the bandwidth of this cut is equal to the sum of the link capacity crossing that cut, divided by the minimum of the sum of server rates in S and the sum of server rates in S prime. For this specific example we have four links crossing the cut. Let's assume they are unit capacity and then we divide by the min of S has two racks of servers, so say there are 40 servers per rack and S prime has six racks of 40 servers, so here the bandwidth of the single cut is 4 divided by 80. Then the bisection bandwidth is the minimum bandwidth over all cuts, so on a tree like network it's easy to compute this because we can simply enumerate over all of the cuts. There is only o of n of them and we can compute that equation I showed on the previous slide and we have a polynomial time algorithm. >>: [inaudible] definition of bisection bandwidth, normally you would just computed as the [inaudible] bandwidth traversing cut, but you are dividing by the number of servers so it seems like it's more like actually fair share bandwidth per server or something? >> Andy Curtis: The reason I'm dividing by the number of servers is because we can have these heterogeneous rates and so we need to take, on one half if here all of these servers were super high-capacity like 10 gig links and these had one gig, then it's not just fair to divide, or not divide by anything. And so we need to normalize it by the amount of capacity we actually expect to cross that cut, and so the reason why we divide by the min is because let's assume homogenous rates for now. Here there are two racks of servers; there are six. These six racks can't push, so there are six units of traffic. They can't push six units of traffic across here because these guys can only receive two units, so that's why we divide by the min, and these guys likewise can only push two out. >>: Okay. I think I see that. I think I was calling it something other than bisection bandwidth because that's already [inaudible] different… >> Andy Curtis: Yeah, I think I probably should call it cut bandwidth or something, so people have a notion of what that is, right? Okay, so this is easy to compute on a tree, however on an arbitrary graph there could be exponentially many cuts, therefore, it would take exponentially time to compute this in the worst case, so if we stop to think about it, the traditional max flow min cut theorem allows us to find the min cut in a network by solving this flow problem, so we'd like to be able to do the same sort of thing here, but the problem is we don't have a single flow problem; we actually have a multi commodity flow problem because each server can be the source and the sink of flows. And in general, the min cut max flow theorem does not hold for multi commodity flow problems. However, we've shown that in our special case there is a min cut max flow theorem and what we've shown here is that our hose traffic model, so throughput in the hose traffic model is equivalent to bisection bandwidth, so therefore if we can compute the max throughput in this traffic model, then we have found the bisection bandwidth and it turns out some guys at Bell Labs had done this a few years ago, so combining these two results we get a polynomial time algorithm to compute the bisection bandwidth. And then we just run the simulated annealing procedure as normal, computing this at each iteration to find the performance of the candidate solution. So I'm going to move on to my evaluation and here the question we really want to answer is well, how much performance do we gain because of this additional heterogeneity in the network. So to evaluate this I tested several different scenarios. First I tested upgrading the Waterloo’s data center network. Then I tried iteratively expanding our network, so this is I add a certain number of servers at each iteration and use the output from one as input into the next iteration. Then I use these algorithms to design brand-new or Greenfield datacenter networks, and then I also asked them to design a new data center network and then iteratively expand it. So this is the cost model I used. This is the cost of the switches; this is the cost of links. For switches it was very hard to get good estimates on street prices of the switches, so these are the best I could find just by Googling around. I wouldn't actually stand by these exact specific values, however, I think the relative differences are meaningful. These are the prices that we used for links. To simplify things I categorized links as short medium or long and then charge accordingly according to the link and according to the rate, and then I charged a different cost to actually install the link and that's because we charge this amount if you're going to move a link, and we charged that same amount so that we don't recharge you for the link itself. So to compare my algorithms against the state-of-the-art, these are the approaches I used. I compared against a generalized fat tree; that's using the most general definition of a fat tree possible and here I don't explicitly construct the fat tree; instead I just bound the best case performance so you are given a budget and I bound the best case performing fat tree that you can build using that budget. Second, I tested against a greedy algorithm. This algorithm just finds the link that improves the performance the most, adds it and then repeats until it's used up its entire budget, or all ports of the network are full. The third thing I tested against was a random graph and that's because a group at UIUC has proposed using random graphs as data center networks and this is due to the fact that random graphs tend to have really nice properties. So this is what our data center network looks like. We have 19 edge switches. Each of these edge switches connects the 40 servers, so we have a total of 760 servers. Our edge switches are heterogeneous already and all of our aggregation switches are the same model. They are all these HB 5406 switches, however this is a modular switch and they do have different line cards so they are heterogeneous as well. And this is our actually topology, so you can see that between their top of racks and irrigation we only have a single link. There is no redundancy there. Additionally our data center handles air quite poorly. There's no isolation between the hot and cold aisles, so to model the fact that my algorithms can take thermal constraints into account, I simply allow you to add more equipment into the racks closest to the chiller, so here this rack can take I think it's 20 kW of equipment, whereas the racks at the other end can't take nearly as much because they don't get as much cold air from the chiller. I want to emphasize that this is very much just a first pass approach and if your datacenter was severely thermally constrained, you would want to do something more sophisticated than this. So let me show the results of expanding the Waterloo datacenter now. So this is our original network, and here I'm showing the normalized bisection bandwidth and right below it I'm showing the diameter. And then here I'll show the number of servers that we've added. So you can see our datacenter network right now has a normalized bisection bandwidth of just over .01 and so this means that it is oversubscribed by a factor of almost 100, so in the first iteration I added 160 servers and then asked each of these algorithms to find an upgrade given a fixed budget. All of these algorithms have the same budget and across iterations the budget is kept the same. You can see that the fat tree was not able to increase the bisection bandwidth of the network while the other approaches were, however, the fat tree is able to attach the new servers to the network without decreasing the bandwidth of the network. You can see here that the greedy approach and Rewire perform the same. They both significantly increased the bisection bandwidth and actually decreased the diameter by one. Legup just increases the bisection bandwidth. Yeah? >>: Before your work if they were going to upgrade probably the data center managers would've just looked at it and intuit an upgrade plan annually. I'm wondering if you are comparing it to that. >> Andy Curtis: So I did ask them about an upgrade plan. Yeah, it's hard to compare against--I didn't explicitly like measure what they would have done and they told me, you know, and essentially our network doesn't need this high of performance and so I admit it's sort of, I'm applying it to a network that doesn't need this kind of bandwidth. For instance, we only have one link between our top of rack and irrigation switches, so what they told me is that the things found by Legup are probably not what they would've thought of, but they seemed like okay, this is an interesting solution. We can probably implement this. They had not considered like the networks done by Rewire or the greedy approach because these are arbitrary meshes and they don't want to do that because it would make it harder for them to manage their network. But, you know, I'm trying to push the frontier to what's possible here, so that's why I think these are still interesting. So as we keep iteratively expanding the network, adding more and more servers, you can see that the performance gap between the Rewire and the other approaches grows, so here after we have added 480 servers, Rewire’s network has four times more bisection bandwidth then the fat tree. It does slightly go down in the next iteration and that's because we're adding more servers, so we're adding more demand to the network, however, it is able to decrease the diameter down to two and because we have this multiobjective function it, due to my weighting, it deferred decreasing the diameters compared to increasing bisection bandwidth. And you see that the greedy algorithm underperforms over time, so initially it did extremely well, however, it wasn't able to increase the performance of the network past this point, and that's likely because it made a poor decision here and over time it isn't able to, then it used all the good ports and over time it doesn't change where things are rewired, excuse me, where things are wired, so it locked itself into this sort of narrow solution. So then the next scenario is just asking these algorithms to build a brand-new, or Greenfield datacenter. >>: I have a question if you don't mind. So how many runs are we looking at here? You said that the greedy algorithm made a bad decision. Did you do that repeatedly or did you just… >> Andy Curtis: Oh, right. So I did do it repeatedly. I'm not showing error bars here, but in the, across all of the experiments, it seemed to do the same thing and that is actually a deterministic algorithm, so yes, there is no… >>: [inaudible] same inputs… >>: You didn't perturb the system to, the error bars didn't represent the results of having perturbed the system to give greedy an opportunity to behave differently. >> Andy Curtis: Right. I mean, so that's because we are using this static input. We're using our existing data center so we run a deterministic algorithm on it and we get the same result. >>: [inaudible] seems like there would be error bars [inaudible]. >> Andy Curtis: Right. There are no error bars on this, so the error bars would be on Rewire because it's a simulated annealing algorithm, however, for that we ran it enough times that there--I didn't actually put them on this chart, but they were very small. >>: I think the greedy algorithm criticism is trying to understand, is it seems like you should do something, if greedy is subject to fall into pits, then you may have shown us a case where greedy is trapped in a pit but it isn't always behaving that way. It maybe would be interesting to ask what would happen if I perturbed the number of servers added a little bit in the way or just provided some source of [inaudible] input that would let you explore more of this space with greedy as well as, because Rewire as you pointed out already has some randomness in it because it explored more of the space. Just to understand whether the greedy, because otherwise the greedy, the comparison to greedy is rather arbitrary. >> Andy Curtis: Okay. That's a fair criticism that I did not, didn't necessarily test enough scenarios with greedy. In the paper there are more cases where we tested and we seem to have found the same thing. Greedy initially does well and then overtime doesn't do very well. Okay. The next scenario is designing a brand-new data center. So for this we ask these algorithms to connect to the network with 1920 servers. So for this I am assuming that each top of rack switch has 48 gigabit ports and 24 of these ports connect down to servers and 24 of them are open and are free to build the switching fabric on top of using these open ports. Here again, I am showing the same type of thing. We have bisection bandwidth on the vertical axis, the diameter here and then the different approaches, so this is the, for a budget of, using a quite small budget of $125 per rack, Rewire is the only algorithm that's able to build a connected network and so I want to emphasize this budget does not include the cost of the top of rack switch. This is only for cabling, irrigation and core switches, so the reason Rewire is able to build a connected network here but no one else is, is because the fat tree and Legup have to spend money to buy irrigation and core switches, so their networks inherently at this low of a budget cost more. The random network has some randomness in it so it's not able to complete, to build a connected network, so that's why Rewire’s the only thing that has connected topology here. As we increase the budget, you can see that Rewire starts to significantly outperform the other approaches for all budgets except for $1000 per rack, where the random network actually has more bisection bandwidth and this is, again, this is the expected bisection bandwidth for the random network, so I'm not actually explicitly building these random networks. I'm using a bound that is proven by some theoreticians on the amount of bisection bandwidth that have--so you see that even Legup significantly outperforms the fat tree designing brand-new networks, so here it's network has twice as much bisection bandwidth with thousand dollar per rack budget. Rewire really outperforms the fat tree so with the $500 per rack budget it has 68 times more bisection bandwidth. When we increase the budget to $1000 per rack it has six times more bisection bandwidth, and in this case where the random network actually beats Rewire in terms of bisection bandwidth, again, this is because I use that small tab objective function, so Rewire preferred decreasing the diameter by one rather than increasing the bisection bandwidth. >>: So in other words Rewire could've found the random solution. >> Andy Curtis: Yes. >>: It just didn't want to. [laughter]. >>: It just didn't want to. >>: You set the objective function therefore [inaudible]. >> Andy Curtis: Right. And I mean also you could seed Rewire with a random network and then ask it to improve on that and it does, it can improve on that. I ran some experiments. The performance gap though does seem to grow or shrink as you use higher switch radices. The higher the radix to the switch, the better the random network does. >>: Did you tell Rewire to try to optimize the thing that you are showing on the y-axis here, the bisection bandwidth? >> Andy Curtis: It's trying to optimize bisection bandwidth minus diameter. >>: Minus diameter, okay. So you're being a little unfair to your run algorithm here because you're showing its performance on a metric that you didn't hold optimize for. >> Andy Curtis: Exactly. So I'm trying to show both of those but it's hard to visualize them. So here yes, it does have a lower diameter, yeah. Okay, now people have brought this up already. The problem with moving towards these heterogeneous networks is management. In particular, there are a few things that are hard on an arbitrary network. Routing is difficult on a nonstructured network. If we're talking about a heterogeneous Clos network, then we could make minor modifications to architectures such as Portland and VL2 and be able to route on a heterogeneous Clos network and this is because fundamentally these networks are still treelike so you just go up to the least common ancestor and back down. However, on an unstructured network, it's quite a bit harder. There is one architecture that allows you to route on unstructured networks. This is called Spain by a group at HP labs, and the way it essentially works is it partitions the network into a bunch of VLANs and then does source routing across these VLANs. For load-balancing, one solution is we could do, we could schedule flows and I have two solutions for that I'll talk about next. And then the other solution again is Spain does have load-balancing built-in where we do this source routing and the source also do loadbalancing and then another option is to use multipath TCP which has been shown that it's able to extract the full section bandwidth from random networks, so we would expect it to be able to extract a full section bandwidth from our unstructured networks as well. Now it is unclear how much it would actually cost to build and over the long-term manage these arbitrary networks. However, I do believe that the performance per dollar in building the network compensates for this. I'm going to move onto my third contribution now which is a framework to perform scalable flow-based networking in the data center. I'm going to apply this to managing flows and the reason we want to do this sort of flow scheduling is for one maximizing throughput on networks like I just showed these unstructured networks, but additionally even on highly regular topologies like a fat tree, we can have the situation where flows collide on a bottleneck link. If we just moved one of these flows over a bit, then we could actually double the throughput of both of these flows. And it's been shown by a group that UC San Diego, that if you perform flow scheduling in the data center for at least some workloads, you can get up to 113% more aggregate throughput. However, their approach depends on OpenFlow and OpenFlow is not scalable, which I'll get into in a minute, and the reason, and so therefore their approach is not scalable as well. So I have two traffic management frameworks to solve this problem. The first I'm calling Mahout and Mahout uses end host to classify elephant flows, which elephant flows to us are long-lived high throughput flows, so once the end host classifies the elephant flows it's set up at a controller and the controller dynamically schedules just the elephant flows to increase the throughput. Our second solution is called DevoFlow and I worked on this joint with a bunch of people at HP Labs and our goal here is to actually provide scalable software defined networking in the data center. Software defined networking allows us to write code that runs on a commodity server and manages the individual flows in our network. This enables a type of programmable network because then we can just write the software to manage the flows. This currently is implemented by the OpenFlow framework which has been deployed at many institutions across the world. You can buy OpenFlow switches from several vendors like NEC and HP. I think that OpenFlow is great. I think it's a great concept but its original design imposes excessive overheads. Now to see why this is, let me explain how OpenFlow works. This is what a traditional switch looks like where we have the data plane and the control plane in the same box. So the data plane just for its packets, and the control plane exchanges reachability information and then builds routing tables based on that. However, OpenFlow separates these two, so it looks something like this, where we have a logically centralized control plane at a central controller and then OpenFlow switches are very dumb switches that just forward packets, so any time a packet arrives at this OpenFlow switch that it doesn't have a forwarding entry for, it has to forward it to the central controller and the controller decides how to route that flow and then it inserts forwarding table entries in all of the switches along the path flow. So the reason we want OpenFlow in the data center is because it enables some pretty innovative network management solutions. Here is a partial list. A few of the things that we can do with OpenFlow are things like consistently enforce security policy across the network. It can be used to implement data center network architectures such as VL2 and Portland. It can be used to build commodity, excuse me load balancers from commodity switches and relevant to me it can be used to do flow scheduling to maximize throughput, or it can also schedule flows to build energy proportional networks, so this works by scheduling the flows on the minimum number of links needed and turning off all the unnecessary equipment. So it's great that OpenFlow can do all of these things, but unfortunately it's not perfect and the reason why it's not perfect is because it has these scaling problems, so implementing any of these solutions in a midsized data center will be quite challenging. So our contributions with this work are first we characterize the overheads of implementing OpenFlow in hardware and, in particular, we found that the overheads of OpenFlow, there's obviously a bottleneck at the central controller since all flow setups need to go through the central controller, that creates an obvious bottleneck. But we found that that's not the real bottleneck. The real problem is at the switches themselves. We found that OpenFlow is very hard to implement in a high-performance way in the switching hardware. So to alleviate this we propose DevoFlow, which is our framework for cost-effective scalable flow management and then we evaluate DevoFlow by using it to perform data center flow scheduling. So I don't have enough time today to go over the overheads of OpenFlow so I'm just going to skip to this and then go into our evaluation. >>: Can I asked just a quick question before you go on? >> Andy Curtis: Yes. >>: I wasn't sure I understood. So you said I would have expected the problem to be that every single flow center [inaudible] whenever why has to go through a centralized point, but you said it wasn't that that it was is hard to put OpenFlow into individual switches but I'm not sure which part of the [inaudible] understand the statement that it seems like OpenFlow is always running in a centralized place and not always [inaudible]. >> Andy Curtis: So OpenFlow also has to run in switches. Let me go, switch to my backup slides here. So I didn't really show the full picture of how this architecture looks like, so we have the A Secure which does this, hardware specialized for forwarding packets, but also switches the CPU and that's used for management functions, and so any time we do this flow set up through the ASIC, it has to go through the CPU and the reason it goes through the CPU is because it needs to perform SSL and also it needs to perform TCP between the switch and the controller and so this CPU in switches today is pretty wimpy. It can't handle the load of setting up all of the flows and so we did simulations and we said well, obviously let's just put a bigger CPU in there and maybe that will work. We found the CPU would need to be two orders of magnitude faster than what is currently implemented in HP's hardware at least. We're not sure about other manufacturers, but for HP it would need to be two orders of magnitude faster. >>: [inaudible] talking to centralized them? >> Andy Curtis: Yeah. >>: So what kind of hardware is it to centralized them? >> Andy Curtis: Essentially the thing is that it could be a server or whatever you want. >>: So you're saying this switch controls [inaudible] CPUs are so slow that even though there are 200 times as many of them they have a bottleneck. >> Andy Curtis: Yes. Unless you can go, I mean, for hundreds of thousands of servers then the centralized controller could be a bottleneck, but you can still distribute it to alleviate the pressure. >>: [inaudible] really quick how you review determined that [inaudible] CPU load is 100% over… >> Andy Curtis: Right, so CPU loads 100% and so we measure, yeah you can measure the CPU load is 100% and so we perform this experiment where we try to just serially set up flows and we found that the 5406 switch could only set up 275 flows per second and then the CPU was 100%. Yeah. >>: So is it setting up a new SSL connection every time? >> Andy Curtis: It's not, no, so it's, it sets that up just once. I mean it's not that dumb to do it every time, but it's still very high overhead at the control point in the switch. I mean, to show you we can expect bursts of up to 10,000 flows at an edge switch in the data center, so it's 40 times more than the switch can currently handle and it does create a lot of latency in this flow set up, so we measured the amount of time it took to just set up this flow and it could take 2 milliseconds. >>: Why is there [inaudible] so is so severely handicapped? >> Andy Curtis: I mean, part of the problem is that these are commodity switches. Like if you were to go out and buy a high-end router, I mean they could probably get rid of a lot of these problems. The second problem is that these switches right now aren't designed for OpenFlow, so they're designed to do normal switching stuff and then they're adding OpenFlow on top of it and for the major vendors now they're not going to let OpenFlow drive their switch development for at least five years or so. I mean I do think it's a great opportunity for some startup to come in and build specialized OpenFlow switches. >>: [inaudible] is it [inaudible]? >> Andy Curtis: Just an edge switch. So yeah, in the discounts and measurements, they show that you can have bursts of up to 10,000 flows per second. So the way that DevoFlow works is we want to find the sweet spot between the fully distributed control and fully centralized control. So DevoFlow stands for devolved OpenFlow and the idea is that we're going to devolve control of our most flows back to our switches. So our design goals were this. We want to keep most flows in the data plane and that's to avoid the latency and the overheads of setting them up at the central controller. Then we want to maintain just enough visibility for effective flow management, so we only want to maintain visibility over the flows that matter to you, not all of the flows. And then our third goal is to actually simplify the design and implementation of highperformance switches, because as I said, it's difficult to do this in OpenFlow right now. You need a fast switch CPU. If you really increase the speed of the CPU you may have to rearchitect the switch itself, so we want to do away with all of that by keeping most flows in the data plane. So the way we do this is through a few different mechanisms. We propose some control mechanisms and some statistics gathering mechanisms. Now one thing I didn't talk about is that collecting the statistics from flows also has quite high overhead and it's because OpenFlow offers one way to gain visibility of your flows and that is you can ask it for all of the forwarding table counters and say how many bytes did each flow transfer in the last, you know, since I last polled you. By doing this, this is very high overhead because you have to get the statistics for every single flow out the switch. So the control mechanisms we propose, the first we call rule cloning. And the idea here is that the ASIC itself in the hardware should clone, should be able to clone a wildcard rule. OpenFlow has two different types of forwarding rules. It has wildcard rules which can have wildcards and exact match rules. So the rule cloning, the way it works is if a flow arrives and it matches say this rule, then it can duplicate that and add a specific entry for that flow into the exact match table. The reason we want to do this is so that we can gain visibility over these flows. We can pull the statistics and just gain visibility over the flows that we cloned. Then the other things that we proposed are some local actions, so rapid rerouting so you can specify fallback ports and so on if the port failure happens and then also we proposed some multipath extensions, so this allows you to select the output port for a flow according to an arbitrary probability distribution, so if a flow arrives, you can select its output according to some distribution such as this, and this allows you to do static load balancing on mesh topologies. >>: Are you proposing [inaudible]? >> Andy Curtis: Implementing this is a little bit harder. All of these things, the only things we haven't implemented DevoFlow in hardware, so all of these things we talked to HP's ASIC designers and they assured us that this should be relatively easy to do in the hardware. >>: [inaudible]. >> Andy Curtis: Right, so I'm not saying per packet, I'm saying per flow. >>: That's the trick right there. If you're not doing per packet [inaudible] numbers okay the desired statistics becomes really hard because then, you know, who knows was going to happen. If you do it by packet it's easier to get the distribution you want. You start doing per flow and then you have to estimate flow sizes and things like that. >> Andy Curtis: Right. We are not taking that into account, but the theory says that if there are lots of mice, then this gives you optimal load balancing. There are lots of elephant flows then you can run into these problems that you mentioned. And I'll get into that, so we're actually going to do some flow scheduling to schedule the elephant flows exclusively. The statistics gathering mechanisms that we proposed are first of all you can just turn on sampling. Most commodity switches already have sampling and in our experiments we found that this does give enough visibility over the flows. Another is triggers and reports, so you can set a rules that if a forwarding table rule has forwarded a certain number of bytes then it will set that flow up at the central controller so this allows you to gain visibility over just elephant flows. And then there is an additional way that we can add approximate counters which allow you to track all the flows matching a wildcard rule. This one is much harder to implement in hardware so I won't use it for my evaluation. But the idea here is that unlike OpenFlow, we want to provide visibility over a subset of flows instead of all the flows. So I mentioned that we haven’t implemented it but we can use existing functional blocks in the 86 for most mechanisms. So DevoFlow provides you the tools to scale your software to find networking applications, however, it still might be quite challenging to scale it. Each application will be different. The idea is essentially you need to define some sort of notion of a significant flow to your application. In flow scheduling, which am going to show, it's easy to find the significant flows. They are just elephant flows. For other things like security, it may be more challenging and so that's why for now I'm only going to show how to do flow scheduling with DevoFlow. So the idea here is that new flows that arrive are handled entirely within the data plane by using these multipath forwarding rules for new flows, and then the central controller uses sampling or triggers to detect elephant flows and the elephant flows are dynamically scheduled by the central controller and then the scheduling is done using a bin packing algorithm. So in our evaluation we want to answer this question. How much performance, how much can we lower the overheads of OpenFlow while still achieving the same performance as a fine-grained flow scheduler. We found that if you perform this flow scheduling, you can increase the throughput 37% per shuffle workload on a Clos network. 55% on a 2-D hypercube and these numbers really depend on the workload, so I also reverse engineered a workload published by Microsoft Research, and I didn't see any sort of performance improvement using this workload. And a bigger reason for this is that I don't have, there's no sort of burst of traffic in this. It's just sort of flat. As I reverse engineered this from the flow inter-arrival times and the flow, the distribution of flow sizes. >>: I understand that evaluation. You are saying that you wanted to reduce the overhead on the control plane while keeping the same performance, but your results are increases in performance? >> Andy Curtis: Right. I'm going to get into that. I just want to show you these are if you do flow scheduling, these are kind of the increases in performance you can get with it, versus DCMP, using just static load balancing. >>: [inaudible] fine grain flow scheduling versus random DCMP essentially. >> Andy Curtis: Yeah, so this is fine-grained over DevoFlow versus randomized with DCMP. Here are the results of the overheads. This, the vertical axis is showing us the number of packets per second to the central controller. This is simulations where we reversed, where we simulated OpenFlow based on our measurements of our real switch. So you can see that if we used OpenFlow stats pulling based mechanism to collect statistics. Then we had about 7700 packets per second going into the controller. If we used DevoFlow based mechanisms such as sampling or this threshold to gain visibility over the flows, then we can reduce the number of flows, excuse me the number of packets to the controller by one to two orders of magnitude, and again, this is just because we're only gaining visibility over the elephant flows here. At least with the thresholds this is only the elephant flows and then samplings, we are collecting samples for every single packet, but it's still lower than collecting the statistics on every single flow. >>: What is the cost in performance? >> Andy Curtis: There is no cost in performance. For the same performance this is the decrease in overheads. So this is showing the number of flow table entries at the average edge switch. So here with OpenFlow we have over 900 flow table entries; with DevoFlow we can reduce that by 75 to 150 times, and this is because most of flows are routed using one single multi-path forwarding rule and then we only need to add specific exact match entries for elephant flows. >>: How are you evaluating [inaudible] because this wasn't implemented; it was simulated, right? >> Andy Curtis: Right. So it's simulated at… So I basically simulated a lot of different scenarios and found, so I sort of did like a binary search in the performance to get them the same. >>: [inaudible]. >> Andy Curtis: The performance is aggregate throughput. >>: Okay. >> Andy Curtis: So here on this workload we're getting the same aggregate throughput using the fine-grained scheduler as DevoFlow and these are the overloads. >>: Okay. So you're taking your, you're assuming your traffic pattern of some kind, and you're simulating what decisions were made as to which point gets which flow and your figuring out how much aggregate was there. >> Andy Curtis: Right. >>: So then you're not simulating and a [inaudible] and packets or… >> Andy Curtis: No. We are doing fluid level simulations here. >>: Okay. Because TCP sometimes has cliff performance that just dropping twice as many packets might give you 1/10 the bandwidth. >> Andy Curtis: Right. So we're not, these are sort of flow level simulations here. >>: So is this DevoFlow similar to [inaudible] in the sense that you only schedule for the big flows except for Hadera did that [inaudible] did this thing in the switch? >> Andy Curtis: So Hadera [phonetic], this is essentially Hadera here because they are using OpenFlow statistics pooling mechanisms to gain visibility over the elephant flows. So this is what Hadera [phonetic] does and then DevoFlow is using our more efficient statistic gathering mechanisms to only look at the flows that matter, the elephant flows. So to summarize DevoFlow, we first characterize the overheads of OpenFlow and then we propose DevoFlow to give you the tools you need to reduce the reliance on the control plane of your software to find networking applications and then we showed that at least for one application which is flow scheduling, it can reduce overhead by 1 to 2 orders of magnitude. I want to just briefly summarize the cost savings that are possible with my, you know, because of my results here. The network is 5 to 15% of, and this is just network equipment, is 5 to 15% of the total cost of ownership of a datacenter. Legup can basically cut the cost of your network in half. Rewire can cut it by as much as an order of magnitude, so then you can significantly save money on your total cost of ownership of your datacenter with these two approaches. And then server utilization is often low because of the network limitations, so if you can extract more bisection bandwidth from your network, then you might be able to deploy fewer servers. Since the servers are the majority of the cost of your datacenter, you may be able to shave some cost savings there as well. So to go over my future work, I want to do quite a few things. First of all I would like to work on a few, a bit more theory type results and I think it's really interesting. I did use expander graphs as a data center network, so expander graphs if you're not familiar with them they are graphs that are essentially rapidly mixing and in the Rewire work we compared, we switched out our objective function and instead of saying okay, maximize bisection bandwidth, we asked them to maximize the spectral gap of the graph, which is the notion of the good expander. We found that these graphs actually performed extremely well and so I'm really interested. I think there is a connection between expansive properties of the graph and the bisection bandwidth, and so I'm going to be interested in exploring that further. There is quite a bit of systems work I want to do. I think that it would be really cool to go out and build an architecture specifically designed for unstructured data center networks. Like I said HP has one, but I don't think it's the last word on that. I think there are a lot of problems still open and managing, you know, multiple data centers. I'm interested in inter-data center networks. Some other things that I have an ongoing project and is deadline aware big data analytics, adding deadlines to these big data type queries. Another thing I'm interested in is green networking systems. We have submitted a paper that is on reducing carbon emissions of internet scale services. So if you think about my work so far, I have worked on a datacenter infrastructure part of things and I sort of want to move up the stack. I want to move up to work on cloud computing, big data analytics, but more on top of that I want to work on applications on top of big data analytics, on top of all these things. I want to work on things like how can we upgrade our cities? How can we apply these same sorts of analytical and theoretical techniques to design transportation systems to jointly design smart grids, transportation systems, to design, also city services like police and fire services. And then also one thing I would like to apply these things to is building zero energy legacy buildings, so people have ways right now of building, you know, buildings that use no energy whatsoever. They are completely carbon neutral. I'd like to develop low-cost ways to retrofit existing buildings to be the same way. And I think this is really a grand challenge because in the next 40 years we can expect 2 to 4 billion people to be moving into the cities so our cities are going to grow tremendously. The number of cars on the road will double. If we don't have smart people thinking of ways to solve these problems, then we are going to have over congested roads and too many people moving into crummy neighborhoods with bad infrastructure. One project I've already worked on this is looking at the return of investment for taxi companies transitioning to electric vehicles and we do find that with today's gas and electricity prices, it's actually profitable for taxi companies right now to move to electric vehicles. To conclude, I developed a theory of high-performance heterogeneous interconnection networks. I built two datacenter network design frameworks, Legup and Rewire. My evaluation of these shows that they can significantly reduce the cost of data center networks, and then I proposed DevoFlow for cost-effective scalable flow management. And that's it. Are there any questions? [applause]. Yes? >>: So when you were talking about the evaluation of Rewire, you said something really interesting. You said that I was asking if you compared it against a manually design network and as part of the answer you said well, the guys, the Rewire solution, we hadn't thought of doing it that way but we would never actually do it because it would be difficult to manage. So I'm wondering are these physically realizable? Are they practical? >> Andy Curtis: I think they are. >>: They didn't think they were. [laughter]. >> Andy Curtis: I mean IT guys, IT guys are resistant to change, so the cabling would be a problem. At smaller scale like in a container it would be doable and then you would probably have to have a different solution or you could run Rewire on the inter-container scale. And then I think really the hard problem right now is how do you route on this big of an arbitrary mesh, and that is going to take some work. Yes. But the purpose of my work of doing this is to see well what's possible. If we get away from what we're doing right now and we think ahead you know, how much better networks could be built? And that's really the question that motivated that work. Yes? >>: [inaudible] DevoFlow thing, so you talked about the security thing and so the OpenFlow when they first talked about this [inaudible] thing they said [inaudible] control. That's something that OpenFlow can provide; it's a great thing. But then they talked about what was mostly the enterprise networks and so I assume that the datacenter domain maybe that kind of [inaudible] network hard to control. Is not that important so maybe that kind of [inaudible] the future. Not that necessary so maybe, you know, just find to do some kind of like control plane pushing some of the list [inaudible] switches and so on so that you can reduce this overhead [inaudible]. >> Andy Curtis: I want to emphasize that we weren't the first people to say let's use OpenFlow in the data center. Other people have done that before us and so we sort of looked at all that work and we said well, look this is really cool work, but will it work? And that's sort of the goal of the DevoFlow work was to give people ways to actually use OpenFlow in the data center. I mean I don't necessarily think that OpenFlow is the right solution for security in the data center and… >>: Why's that? >> Andy Curtis: Because, as you mentioned, per flow security in the data center might not be doable, and so we may need other things, but one way you could use DevoFlow is on a per flow, not on a per flow but on a categorical basis, apply some sort of route your traffic through a set of middle boxes that apply these features. That could be doable with using these wildcard rules. Yeah? >>: I have a question on your DevoFlow evaluation. So you said you found a [inaudible] is the CPU on each switch for the flow set up and that limits the number of [inaudible]. But then the evaluation was looking at metrics like packets per second statistics and 40 table sizes and things like that, were you trying to use these as a proxy for flow set up or, there was a missing link there that I didn't get. >> Andy Curtis: Yes. We were trying to use that as a proxy for flow set up. Like if we were setting only an order of magnitude fewer flows up through the controller that means only an order of magnitude are being set up through the switch CPU as well. >> Ratul Mahajan: If there are no other questions, please think our speaker again. [applause].