>> Ratul Mahajan: Hi. Good morning everybody. ... today to have Andy Curtis who is visiting us from...

advertisement
>> Ratul Mahajan: Hi. Good morning everybody. Thanks for coming. It's a great pleasure
today to have Andy Curtis who is visiting us from the University of Waterloo and he's going to
tell us about a bunch of exciting stuff that he's done at Waterloo and with some HP
collaborators around managing and operating data center networks.
>> Andy Curtis: Thanks. Good morning everyone. Today I'm going to show you how to reduce
the cost of operating a data center network by up to an order of magnitude. Now because the
data center network is 10 to 15% of the total cost of operating a datacenter, this can result in
pretty significant cost savings. The datacenter has been described as the new computer and
the network plays a crucial role in this new computer. It interconnects the compute nodes, so
we need a high-performing network. In particular, network performance is critical for doing
things like dynamically allocating servers to services. This allows you to dynamically grow and
shrink the pool servers assigned to a service, so you can maximize your server utilization. If you
don't have enough bandwidth in the network, then you need to statically assign enough servers
to your service to handle peak load, and this results in very underutilized servers which then
costs more because you have to buy more servers. Another thing that a high-performing
network is useful for is doing things like quickly migrating virtual machines and this is also
useful for service level load balancing, and if the network doesn't have enough bandwidth it can
be a serious bottleneck in the performance of big data in analytic frameworks. For example, in
the shuffle phase of a MapReduce job of 2 terabytes of data are transferred across a network.
Now when designing any sort of network we need to take into account the constraints and
goals of the target environment. So the datacenter has a few new things to it. First of all it is a
huge scale. The network needs to be able to interconnect hundreds of thousands of servers
and with very high bandwidth as I just mentioned. An additional lesser considered requirement
of the datacenter is that the network needs to handle the addition of servers to the datacenter.
So this is an aerial view of the Microsoft datacenter in Dublin and these white units on the roof
are modular datacenter units, and they probably each contain about 1000 to 2000 servers. You
can see that the roof is about a third of the way built out, and so as we build it out there is
going to be significantly more servers added to this datacenter.
>>: The computers are in little boxes on the roof?
>> Andy Curtis: They are in modular…
>>: What's inside that big building?
>> Andy Curtis: That's a good question. So there is also a traditional raised floor data center
within, but this is sort of a, I guess they're trying out this different architecture. If we're
designing our network to handle this sort of growth, the network needs to be able to have
incremental expandability. If we don't account for the fact that our datacenter will grow and
change over time, then our network could end up being a mess after several years. So what we
need is a flexible datacenter network.
>>: [inaudible] the same Microsoft [inaudible]? [laughter].
>> Andy Curtis: No, I'm not sure. But I don't think that this is a Microsoft data center.
[laughter]. So if we consider the traditional datacenter network topology such as the Clos,
flattened butterfly, HyperX, BCube and so on, these topologies are all highly regular structures,
so they are incompatible with each other and they are in incompatible with legacy data centers.
Let me just illustrate this with a simple example. So this is the standard fat tree. Here each
switch in the network has four ports, and then these yellow rectangles represent racks of
servers. So let's suppose things are going well, so over time we need to add a couple more
racks of service to this datacenter. The question here becomes well how do we Rewire this
network to support these additional servers. If we want to maintain the fat tree topology, we
have to replace every single switch in this network. That is not cost effective and it could result
in a significant downtime which is unacceptable in a data center environment. So we need
flexible data center networks. Additionally, data center networks are hard to manage. Go
ahead.
>>: The main problem is that it was [inaudible] and if you have [audio begins] to build out the
network that has [inaudible] ports and you can still add more chores and [inaudible] later and
so without actually replacing the core.
>> Andy Curtis: Right, so that is certainly true that, I mean, I chose this example to be very bad.
This is the worst case. However, even if we did sort of plan ahead for the growth, we are
spending a lot of money up front that we don't necessarily need to, and ethernet speeds are
increasing faster than Moore's law, so we can sort of ride the cost curve down if we can delay
deployment of additional capacity until we need it. So besides being inflexible, data center
networks are hard to manage. This is primarily because of their huge scale. They can consist of
up to tens of thousands of network elements, and additionally they have a multiplicity of end
end paths, so this is useful for providing high bisection bandwidth and high availability,
however, it makes traffic management quite challenging because traditional networking
protocols were not built to handle this multiplicity of end to end paths. So let me just
summarize challenges briefly. I've identified these two challenges first, designing a new
upgraded or expanded data center network is a hard problem, and the second is then managing
this network is also challenging, and that's mostly because these networks are very different
than enterprise networks. So to resolve these challenges, I have made the following
contributions in my dissertation. First, I developed theory to understand heterogeneous highperformance networks. By heterogeneous I mean that the network can support switches with
different numbers of ports and different link rates. Second, I've developed two optimization
frameworks to design these types of data center networks. The first is the framework design
heterogeneous Clos networks, and I will get into exactly what those are in a minute, and the
second design is a completely unstructured data center network, so these are arbitrary mesh
networks, and I will get into why we want those. And then the third contribution I propose, I
made is scalable flow-based networking in the data center, so this allows you to manage the
individual flows in your network using software running on a commodity PC, and the
application that we're going to use with this is to do traffic management in the data center. So
to describe my first two contributions to you I am going to describe these two optimization
frameworks that I developed. The first I call Legup and as I mentioned this design is
heterogeneous Clos networks. The second I call Rewire and this designs unstructured data
center networks. So both of these are optimization frameworks for datacenter network design
and as input they take in a budget, which is the maximum amount of money that you want to
spend on the network. If you have an existing topology they can take that in so they can
perform an upgrade or an expansion of that. Additionally, they take in the list of switches so
this needs to have the specifications and prices of the switches available in the market. If you
include modular switches you have to include also the details of line cards, and then optionally
they can take in a data center model. This is a physical model of your data center so you can
describe the rack by rack configuration of your data center and these frameworks will take this
into account to do things like estimate the cost of links, so like a link that attaches to two
adjacent racks should be cheaper than a link that crosses the length of the datacenter. So my
frameworks take this input and they perform an optimization algorithm to find some output
topology. Now when we started thinking about this problem we started with this hypothesis,
and that is that by allowing switch heterogeneity we would be able to reduce cost. And the
reason we made this hypothesis is because of the regularity and the rigidity of existing
constructions don't allow any heterogeneity in your switches, so by allowing this we believe
that we can come up with more flexible networks that we can then expand and upgrade more
cost-effectively. So when deciding the output topology for an optimization framework for the
first pass we decided to constrain this output to sort of a Clos-like network and I called this the
heterogeneous Clos network. Again, by heterogeneity I mean that you can use switches with
different numbers of ports, so that is their radices can be different and we can have different
link rates.
>>: So in thinking that heterogeneous switches would reduce costs, are you taking into account
in any way that the additional, or the higher cost of managing the heterogeneous set of
switches?
>> Andy Curtis: Right, so, I am not taking that into account. I'm taking into account the
additional cost to build it, but not to manage it. Yes?
>>: So when you have a DCM which has a heterogeneous device, have you considered like if
you buy another device of the same type you can buy at a much lower cost than buying many
different kinds of devices, especially when you need to customize each device based on your
needs it is very difficult to ask model vendors to customize their based on your needs.
>> Andy Curtis: Right. So I have considered that. I don't explicitly include it in my model right
now, however, you could very easily extend the framework to include that to where you can
buy at bulk discounts. But I don't consider it for now. When I'm describing this, I'm going to go
into describing the theory of heterogeneous Clos networks now, and then I'll show you how to
actually build these things with an algorithm. So while I'm describing this I want you to assume
that we can route on these networks and that we can do load balancing perfectly across them,
and then later on I will show you how to get rid of these two assumptions. So to review the
Clos network, it looks like this and this is what I call a physical realization of the Clos network,
because it represents the physical interconnection between switches. But it turns out that
there is a more compact way that we can represent this, and that is by collapsing each of these
bipartite, complete bipartite subgraphs into a single edge. So we have something that looks
like this. And I call this the logical topology. So here each logical edge represents this complete
bipartite subgraph, and the number on it indicates the capacity of the underlying physical
network. It turns out that for a Clos network the logical topology is always a tree, but I started
thinking about this and I thought well why can't we separate this and deploy the capacity across
a forest of trees? And so the problem becomes now if we can split the capacity like this, the
problem of designing a heterogeneous Clos network is that we are given a set of top of rack
switches and each rack has a demand which this is the uplink rate that it would like. Here you
can think of this rack wants 4 gigabits of uplink and over there they want 64 bits of uplink. This
rack of servers should be able to get this uplink rate regardless of the traffic matrix, so this is
also called the hose model. It turns out that for this set of top of rack switches there are three
optimal logical topologies, and by optimal I mean these topologies use the minimum amount of
link capacity necessary, sufficient and necessary to serve these demands here, so optimality at
least in theory is only on link capacity. Now there, so for any given set of top of rack switches
there can be a bunch of different logical topologies, so our first result is how to construct all
optimal logical topologies given a set of top of rack switches. Then once we have these logical
topologies we need to know how to actually translate them back to a physical network, and so
that is our second result is that given a logical topology we find all physical realizations of it. For
this logical topology here is one physical realization of it. Question?
>>: Back to the, two slides back now. I am trying to understand is if, the thing on the right is
very irregular you might say, you are sending down, the guy that wants 64, you're sending 8 in
one direction and 56 the other, is this showing that X1 and X2 are different types of switches or
what optimization went into that?
>> Andy Curtis: X1 and X2 are just these logical nodes that represent a physical set of switches,
so there can be different ways to represent these with physical switches. The intuition here is
that these nodes need to send 64 units of traffic and be able to send that anywhere in the
network, so, but because these nodes only need 4 units of traffic then we don't necessarily
have to send all 64 of that connecting to these guys. If we get 4 connecting to those guys then
that's enough to serve all of the different traffic matrices possible. So the sum of uplink
bandwidth is the same in all of these logical constructions; it's just that we distributed it
differently. Yes.
>>: So are you assuming that the traffic demand is static?
>> Andy Curtis: No, I am not assuming that the traffic demand is static. I am assuming that this
hose model which is actually a polyhedron of traffic matrices, so it's an infinite set of traffic
matrices and it's all the traffic matrices that are allowed under these rates here, so as long as
this rack never sends or receives more than its rate, then that is a valid traffic matrix.
>>: So that is the upper bound of the traffic matrix?
>> Andy Curtis: Yes, so we are sort of optimizing for the worst traffic matrix possible. Again,
this is the physical realization of this and here are physical realizations of the other topologies.
So you can see that we just essentially spread the capacity out across a certain number of
physical switches. Then the link rates are determined by the logical edges. To summarize, the
first result is how to construct all logical topologies. The second is how to translate a logical
topology into all of its different physical realizations. Together they give us a theorem that
characterizes these heterogeneous Clos networks. As far as I am aware this is the first optimal
topology construction that allows heterogeneous switches. Now this theory is nice and it is
very elegant, however, it doesn't tell us how to actually build these networks in practice
because the metric for a good topology under the theory is that it uses the minimal amount of
link capacity, but in practice we have to need to take other things into account such as the
actual cost of the devices. In practice a data center network should maximize performance
while minimizing costs, should also be realizable in the target datacenter, so this means that for
instance, if we have, in order to realize the topology, we have so many switches that it's going
to draw too much power, that doesn't do us any good if we can't actually build that. Finally if
we are talking about upgrading or expanding a data center network, our algorithm should be
able to incorporate the existing network equipment into the network if it makes sense to do so.
>>: [inaudible] proof of the theorem. I'm wondering why you are not running into a
complication [inaudible] associated with bin packing or such problems.
>> Andy Curtis: Okay, yeah, that's a good question. Why don't we have to worry about bin
packing here? And that's again, because I am assuming that the load-balancing is perfect so
that we can split flows. So if you have splitable flows you don't run into this bin packing
problem. You can just solve using linear programming. Actually we can solve it analytically is
the way we solved it.
>>: So those should be able to, you can arbitrarily split those in any switches?
>> Andy Curtis: Exactly. So by arbitrarily, that's why I'm assuming this for the theories, because
it makes the theories much easier.
>>: Do you know any switches that can proportionally divide the load across multiple
[inaudible]?
>> Andy Curtis: I don't know any switch that can do that, which is why we wouldn't do that in
practice, and that's why at the end of my talk I'm going to talk about traffic engineering, like
doing flow scheduling, so that we can maximize throughput even with this type of topology.
Yeah?
>>: I'm wondering if you're taking failures into account because a lot of data center networks
are built so that the failure in any one switch has minimal impact on the network as a whole,
but it seems like if you have X1 and X2 and X2 is handling 90% of the traffic and X1 is only
handling 10, a failure of X2 would cause a disproportional disruption.
>> Andy Curtis: As far as failures, the way we handled that is in the optimization algorithm;
there are different ways you can formulate it, but I formulated it as a constraint in the
optimization problem that says each rack must have this many, this much capacity if there is a
certain number of link cuts. Yeah?
>>: Why don't you also have flexibility as one of the goals? Maybe that rack in the link will be
eight [inaudible]?
>> Andy Curtis: Right. So when I originally did Legup, I did in fact have the optimization criteria
was to maximize bisection bandwidth plus flexibility where we had some notion of flexibility,
but it turns out that I think it's really hard to capture flexibility in a nice simple metric, so that's
why for my second step in Rewire I abandoned that because I couldn't find a good formulation
of that and I would be happy to know if you have a good one.
>>: The topology looks more like an asymmetric topology. If you make any change to your
topology, can you still maintain optimality of the topology, or do you have to rerun optimization
[inaudible]?
>> Andy Curtis: Right. The changes, I want to emphasize that this is just the theory and, you
know, when we actually design these with the optimization algorithm, it does try, if you need to
make changes, it tries to minimize the cost of making those changes. It tries to, we take into
account the cost of rewiring things and so on, so as a human ideally my goal is to not have to
think about that and let algorithm think about that for you.
>>: I see an advantage in the [inaudible] approach with flexibility of making changes [inaudible]
accommodate [inaudible] devices and just come up with one topology. Can you show
something that you are doing to make changes to the topology?
>> Andy Curtis: Yeah, so you can flexibly, so these are more flexible in the sense that you have
with the Clos you have one configuration, so for these I've actually just shown one way of
physically, for the set of top of racks, we had three different logical topologies and actually I
only showed you these three configurations, but there are a bunch of ways. So because of this
additional flexibility, like say you need to make one change here, you have a lot of topology
options that are still like optimal under like link capacity constrained, whereas, with the Clos
you only have one arrangement. So that's why it's so much more flexible is because there is
this multiplicity of topologies that are also optimal. I want to emphasize that the algorithm
doesn't require the optimality of the topology; this is just what it aims to do.
>>: At this point you have any examples to show that given this traffic matrix you can, Clos has
this topology to handle [inaudible] the heterogeneous Clos can come up with this topology to
accommodate this traffic demand, and in the future if this traffic demand change is it cheaper
to accommodate for the new traffic demand than [inaudible] regional cost?
>> Andy Curtis: I guess I don't have a specific example for you right off the top of my head. I
don't think it would be hard to find one though, and I'll show you our experiments of using this
stuff at the University of Waterloo's data center and we do actually find significantly lower cost
solutions. Like I said, that's a nice theory, but we now need an optimization algorithm to
actually design these sorts of networks. So I'm just going to briefly go over the Legup algorithm.
What it does is it performs a branching balance search of the solution space. Normally with
branch and bound you can guarantee the optimality, however, we can't quite guarantee that,
because we have to use some heuristics to map switches to racks area do we do this to
minimize the length of cabling used, and then this algorithm does scale reasonably well. In the
worst case it is exponential in the number of top over racks and the number of switch types;
however, in my experiments I didn't find this behavior. For a 760 server datacenter, it took
about 5 to 10 minutes to run the algorithm and this is for the hardest input I could find. If you
give it an easy input, for instance, if the top rack switches are homogenous, it only takes a
couple of seconds to run. For a datacenter 10 times that large it takes a couple of days, but my
implementation only runs on a single core and it would be easy to parallelize or distribute this.
>>: [inaudible] heuristics why don't you just add a cost factor or a constraint for [inaudible] the
optimization [inaudible]?
>> Andy Curtis: I did, so you're seeing in the formulation of the problem the cost. So I didn't do
that just to avoid the additional complexity of that, however, I think that given these pretty
good runtimes it would probably be possible to do that, but I haven't explored it. So to
summarize Legup, I developed this theory of heterogeneous Clos networks, implemented the
Legup design algorithm and then I evaluated it by applying it to our data center and I'll show
you more results later after I describe Rewire, but for now I will spoil some results and say that
for our datacenter it cuts the cost of an upgrade in half versus the fat-tree. So let me move on
to Rewire. With Rewire we are going to do away with the structure of the network entirely and
design entirely unstructured networks. I'm really motivated by this question of, so it seemed
like the structure was hurting us in a Clos network, and by allowing some amount of flexibility
we could do a lot better, so if we just use an arbitrary mesh how much better can we actually
do? The problem here is that now we have a really hard network design problem. The
heterogeneous Clos networks are still so much constrained so we could still sort of iterate
through all of the different possibilities and evaluate them, but now we have completely
arbitrary mesh so there are many, many, many different networks for any different set of top
racks, so to solve this I used the simulated annealing algorithm, and the goal of this algorithm is
to maximize performance and by performance I mean bisection bandwidth minus the diameter
of the network. So if you don't know what bisection bandwidth is right now, I'll explain exactly
what that is in a minute, and the diameter is the worst case shortest path between any two top
of racks. I'm using diameter here as sort of a proxy for latency, because latency is actually very
hard to estimate. You need to know queuing delays and so on, so that's why diameter is just a
proxy for that.
>>: [inaudible] different units?
>> Andy Curtis: These are different units, so what I do is I scale it to be between zero and one,
so diameter, you can think that the best diameter in the network is one, one hop between all of
the nodes; the worst is a path, so you can scale that to be between zero and one, and then
bisection bandwidth I normalize this as well, so then you can weigh each of these by however
much you want. It does take some playing with the weights to get which you want, but
because you can tweak it.
>>: So the answer [inaudible] weighting to produce an arbitrary scale factor?
>> Andy Curtis: Yes, exactly. So then Rewire maximizes the performance subject to the same
constraints of Legup subject to the budget. Your data center model if you give it one, but now
we have no topology restrictions. Here the costs we take into account are the costs of any new
cables you may have to buy, the cost to install or move cables and then the cost of any new
switches that you may add to the datacenter. So Rewire performs standard simulated
annealing, so at each iteration it computes the performance of a candidate solution and then if
that solution is accepted, it computes the next neighbor to consider and so on, and repeats this
until it's converged. Now we do have some heuristics for deciding the next neighbor to
consider, but I don't have time to cover that because I want to talk about how to compute the
performance of the network. It turns out there are no known polynomial time algorithm to find
the bisection bandwidth of an arbitrary network. So the bisection bandwidth is the minimum
bandwidth across any cut in the network, and we can find the bandwidth of a single cut pretty
easily. Let me denote the servers on one side of this cut by S, the others by S prime. Then the
bandwidth of this cut is equal to the sum of the link capacity crossing that cut, divided by the
minimum of the sum of server rates in S and the sum of server rates in S prime. For this specific
example we have four links crossing the cut. Let's assume they are unit capacity and then we
divide by the min of S has two racks of servers, so say there are 40 servers per rack and S prime
has six racks of 40 servers, so here the bandwidth of the single cut is 4 divided by 80. Then the
bisection bandwidth is the minimum bandwidth over all cuts, so on a tree like network it's easy
to compute this because we can simply enumerate over all of the cuts. There is only o of n of
them and we can compute that equation I showed on the previous slide and we have a
polynomial time algorithm.
>>: [inaudible] definition of bisection bandwidth, normally you would just computed as the
[inaudible] bandwidth traversing cut, but you are dividing by the number of servers so it seems
like it's more like actually fair share bandwidth per server or something?
>> Andy Curtis: The reason I'm dividing by the number of servers is because we can have these
heterogeneous rates and so we need to take, on one half if here all of these servers were super
high-capacity like 10 gig links and these had one gig, then it's not just fair to divide, or not divide
by anything. And so we need to normalize it by the amount of capacity we actually expect to
cross that cut, and so the reason why we divide by the min is because let's assume homogenous
rates for now. Here there are two racks of servers; there are six. These six racks can't push, so
there are six units of traffic. They can't push six units of traffic across here because these guys
can only receive two units, so that's why we divide by the min, and these guys likewise can only
push two out.
>>: Okay. I think I see that. I think I was calling it something other than bisection bandwidth
because that's already [inaudible] different…
>> Andy Curtis: Yeah, I think I probably should call it cut bandwidth or something, so people
have a notion of what that is, right? Okay, so this is easy to compute on a tree, however on an
arbitrary graph there could be exponentially many cuts, therefore, it would take exponentially
time to compute this in the worst case, so if we stop to think about it, the traditional max flow
min cut theorem allows us to find the min cut in a network by solving this flow problem, so
we'd like to be able to do the same sort of thing here, but the problem is we don't have a single
flow problem; we actually have a multi commodity flow problem because each server can be
the source and the sink of flows. And in general, the min cut max flow theorem does not hold
for multi commodity flow problems. However, we've shown that in our special case there is a
min cut max flow theorem and what we've shown here is that our hose traffic model, so
throughput in the hose traffic model is equivalent to bisection bandwidth, so therefore if we
can compute the max throughput in this traffic model, then we have found the bisection
bandwidth and it turns out some guys at Bell Labs had done this a few years ago, so combining
these two results we get a polynomial time algorithm to compute the bisection bandwidth.
And then we just run the simulated annealing procedure as normal, computing this at each
iteration to find the performance of the candidate solution. So I'm going to move on to my
evaluation and here the question we really want to answer is well, how much performance do
we gain because of this additional heterogeneity in the network. So to evaluate this I tested
several different scenarios. First I tested upgrading the Waterloo’s data center network. Then I
tried iteratively expanding our network, so this is I add a certain number of servers at each
iteration and use the output from one as input into the next iteration. Then I use these
algorithms to design brand-new or Greenfield datacenter networks, and then I also asked them
to design a new data center network and then iteratively expand it. So this is the cost model I
used. This is the cost of the switches; this is the cost of links. For switches it was very hard to
get good estimates on street prices of the switches, so these are the best I could find just by
Googling around. I wouldn't actually stand by these exact specific values, however, I think the
relative differences are meaningful. These are the prices that we used for links. To simplify
things I categorized links as short medium or long and then charge accordingly according to the
link and according to the rate, and then I charged a different cost to actually install the link and
that's because we charge this amount if you're going to move a link, and we charged that same
amount so that we don't recharge you for the link itself. So to compare my algorithms against
the state-of-the-art, these are the approaches I used. I compared against a generalized fat tree;
that's using the most general definition of a fat tree possible and here I don't explicitly
construct the fat tree; instead I just bound the best case performance so you are given a budget
and I bound the best case performing fat tree that you can build using that budget. Second, I
tested against a greedy algorithm. This algorithm just finds the link that improves the
performance the most, adds it and then repeats until it's used up its entire budget, or all ports
of the network are full. The third thing I tested against was a random graph and that's because
a group at UIUC has proposed using random graphs as data center networks and this is due to
the fact that random graphs tend to have really nice properties. So this is what our data center
network looks like. We have 19 edge switches. Each of these edge switches connects the 40
servers, so we have a total of 760 servers. Our edge switches are heterogeneous already and
all of our aggregation switches are the same model. They are all these HB 5406 switches,
however this is a modular switch and they do have different line cards so they are
heterogeneous as well. And this is our actually topology, so you can see that between their top
of racks and irrigation we only have a single link. There is no redundancy there. Additionally
our data center handles air quite poorly. There's no isolation between the hot and cold aisles,
so to model the fact that my algorithms can take thermal constraints into account, I simply
allow you to add more equipment into the racks closest to the chiller, so here this rack can take
I think it's 20 kW of equipment, whereas the racks at the other end can't take nearly as much
because they don't get as much cold air from the chiller. I want to emphasize that this is very
much just a first pass approach and if your datacenter was severely thermally constrained, you
would want to do something more sophisticated than this. So let me show the results of
expanding the Waterloo datacenter now. So this is our original network, and here I'm showing
the normalized bisection bandwidth and right below it I'm showing the diameter. And then
here I'll show the number of servers that we've added. So you can see our datacenter network
right now has a normalized bisection bandwidth of just over .01 and so this means that it is
oversubscribed by a factor of almost 100, so in the first iteration I added 160 servers and then
asked each of these algorithms to find an upgrade given a fixed budget. All of these algorithms
have the same budget and across iterations the budget is kept the same. You can see that the
fat tree was not able to increase the bisection bandwidth of the network while the other
approaches were, however, the fat tree is able to attach the new servers to the network
without decreasing the bandwidth of the network. You can see here that the greedy approach
and Rewire perform the same. They both significantly increased the bisection bandwidth and
actually decreased the diameter by one. Legup just increases the bisection bandwidth. Yeah?
>>: Before your work if they were going to upgrade probably the data center managers
would've just looked at it and intuit an upgrade plan annually. I'm wondering if you are
comparing it to that.
>> Andy Curtis: So I did ask them about an upgrade plan. Yeah, it's hard to compare against--I
didn't explicitly like measure what they would have done and they told me, you know, and
essentially our network doesn't need this high of performance and so I admit it's sort of, I'm
applying it to a network that doesn't need this kind of bandwidth. For instance, we only have
one link between our top of rack and irrigation switches, so what they told me is that the things
found by Legup are probably not what they would've thought of, but they seemed like okay,
this is an interesting solution. We can probably implement this. They had not considered like
the networks done by Rewire or the greedy approach because these are arbitrary meshes and
they don't want to do that because it would make it harder for them to manage their network.
But, you know, I'm trying to push the frontier to what's possible here, so that's why I think
these are still interesting. So as we keep iteratively expanding the network, adding more and
more servers, you can see that the performance gap between the Rewire and the other
approaches grows, so here after we have added 480 servers, Rewire’s network has four times
more bisection bandwidth then the fat tree. It does slightly go down in the next iteration and
that's because we're adding more servers, so we're adding more demand to the network,
however, it is able to decrease the diameter down to two and because we have this multiobjective function it, due to my weighting, it deferred decreasing the diameters compared to
increasing bisection bandwidth. And you see that the greedy algorithm underperforms over
time, so initially it did extremely well, however, it wasn't able to increase the performance of
the network past this point, and that's likely because it made a poor decision here and over
time it isn't able to, then it used all the good ports and over time it doesn't change where things
are rewired, excuse me, where things are wired, so it locked itself into this sort of narrow
solution. So then the next scenario is just asking these algorithms to build a brand-new, or
Greenfield datacenter.
>>: I have a question if you don't mind. So how many runs are we looking at here? You said
that the greedy algorithm made a bad decision. Did you do that repeatedly or did you just…
>> Andy Curtis: Oh, right. So I did do it repeatedly. I'm not showing error bars here, but in the,
across all of the experiments, it seemed to do the same thing and that is actually a deterministic
algorithm, so yes, there is no…
>>: [inaudible] same inputs…
>>: You didn't perturb the system to, the error bars didn't represent the results of having
perturbed the system to give greedy an opportunity to behave differently.
>> Andy Curtis: Right. I mean, so that's because we are using this static input. We're using our
existing data center so we run a deterministic algorithm on it and we get the same result.
>>: [inaudible] seems like there would be error bars [inaudible].
>> Andy Curtis: Right. There are no error bars on this, so the error bars would be on Rewire
because it's a simulated annealing algorithm, however, for that we ran it enough times that
there--I didn't actually put them on this chart, but they were very small.
>>: I think the greedy algorithm criticism is trying to understand, is it seems like you should do
something, if greedy is subject to fall into pits, then you may have shown us a case where
greedy is trapped in a pit but it isn't always behaving that way. It maybe would be interesting
to ask what would happen if I perturbed the number of servers added a little bit in the way or
just provided some source of [inaudible] input that would let you explore more of this space
with greedy as well as, because Rewire as you pointed out already has some randomness in it
because it explored more of the space. Just to understand whether the greedy, because
otherwise the greedy, the comparison to greedy is rather arbitrary.
>> Andy Curtis: Okay. That's a fair criticism that I did not, didn't necessarily test enough
scenarios with greedy. In the paper there are more cases where we tested and we seem to
have found the same thing. Greedy initially does well and then overtime doesn't do very well.
Okay. The next scenario is designing a brand-new data center. So for this we ask these
algorithms to connect to the network with 1920 servers. So for this I am assuming that each
top of rack switch has 48 gigabit ports and 24 of these ports connect down to servers and 24 of
them are open and are free to build the switching fabric on top of using these open ports. Here
again, I am showing the same type of thing. We have bisection bandwidth on the vertical axis,
the diameter here and then the different approaches, so this is the, for a budget of, using a
quite small budget of $125 per rack, Rewire is the only algorithm that's able to build a
connected network and so I want to emphasize this budget does not include the cost of the top
of rack switch. This is only for cabling, irrigation and core switches, so the reason Rewire is able
to build a connected network here but no one else is, is because the fat tree and Legup have to
spend money to buy irrigation and core switches, so their networks inherently at this low of a
budget cost more. The random network has some randomness in it so it's not able to
complete, to build a connected network, so that's why Rewire’s the only thing that has
connected topology here. As we increase the budget, you can see that Rewire starts to
significantly outperform the other approaches for all budgets except for $1000 per rack, where
the random network actually has more bisection bandwidth and this is, again, this is the
expected bisection bandwidth for the random network, so I'm not actually explicitly building
these random networks. I'm using a bound that is proven by some theoreticians on the amount
of bisection bandwidth that have--so you see that even Legup significantly outperforms the fat
tree designing brand-new networks, so here it's network has twice as much bisection
bandwidth with thousand dollar per rack budget. Rewire really outperforms the fat tree so
with the $500 per rack budget it has 68 times more bisection bandwidth. When we increase
the budget to $1000 per rack it has six times more bisection bandwidth, and in this case where
the random network actually beats Rewire in terms of bisection bandwidth, again, this is
because I use that small tab objective function, so Rewire preferred decreasing the diameter by
one rather than increasing the bisection bandwidth.
>>: So in other words Rewire could've found the random solution.
>> Andy Curtis: Yes.
>>: It just didn't want to. [laughter].
>>: It just didn't want to.
>>: You set the objective function therefore [inaudible].
>> Andy Curtis: Right. And I mean also you could seed Rewire with a random network and then
ask it to improve on that and it does, it can improve on that. I ran some experiments. The
performance gap though does seem to grow or shrink as you use higher switch radices. The
higher the radix to the switch, the better the random network does.
>>: Did you tell Rewire to try to optimize the thing that you are showing on the y-axis here, the
bisection bandwidth?
>> Andy Curtis: It's trying to optimize bisection bandwidth minus diameter.
>>: Minus diameter, okay. So you're being a little unfair to your run algorithm here because
you're showing its performance on a metric that you didn't hold optimize for.
>> Andy Curtis: Exactly. So I'm trying to show both of those but it's hard to visualize them. So
here yes, it does have a lower diameter, yeah. Okay, now people have brought this up already.
The problem with moving towards these heterogeneous networks is management. In
particular, there are a few things that are hard on an arbitrary network. Routing is difficult on a
nonstructured network. If we're talking about a heterogeneous Clos network, then we could
make minor modifications to architectures such as Portland and VL2 and be able to route on a
heterogeneous Clos network and this is because fundamentally these networks are still treelike
so you just go up to the least common ancestor and back down. However, on an unstructured
network, it's quite a bit harder. There is one architecture that allows you to route on
unstructured networks. This is called Spain by a group at HP labs, and the way it essentially
works is it partitions the network into a bunch of VLANs and then does source routing across
these VLANs. For load-balancing, one solution is we could do, we could schedule flows and I
have two solutions for that I'll talk about next. And then the other solution again is Spain does
have load-balancing built-in where we do this source routing and the source also do loadbalancing and then another option is to use multipath TCP which has been shown that it's able
to extract the full section bandwidth from random networks, so we would expect it to be able
to extract a full section bandwidth from our unstructured networks as well. Now it is unclear
how much it would actually cost to build and over the long-term manage these arbitrary
networks. However, I do believe that the performance per dollar in building the network
compensates for this. I'm going to move onto my third contribution now which is a framework
to perform scalable flow-based networking in the data center. I'm going to apply this to
managing flows and the reason we want to do this sort of flow scheduling is for one maximizing
throughput on networks like I just showed these unstructured networks, but additionally even
on highly regular topologies like a fat tree, we can have the situation where flows collide on a
bottleneck link. If we just moved one of these flows over a bit, then we could actually double
the throughput of both of these flows. And it's been shown by a group that UC San Diego, that
if you perform flow scheduling in the data center for at least some workloads, you can get up to
113% more aggregate throughput. However, their approach depends on OpenFlow and
OpenFlow is not scalable, which I'll get into in a minute, and the reason, and so therefore their
approach is not scalable as well. So I have two traffic management frameworks to solve this
problem. The first I'm calling Mahout and Mahout uses end host to classify elephant flows,
which elephant flows to us are long-lived high throughput flows, so once the end host classifies
the elephant flows it's set up at a controller and the controller dynamically schedules just the
elephant flows to increase the throughput. Our second solution is called DevoFlow and I
worked on this joint with a bunch of people at HP Labs and our goal here is to actually provide
scalable software defined networking in the data center. Software defined networking allows
us to write code that runs on a commodity server and manages the individual flows in our
network. This enables a type of programmable network because then we can just write the
software to manage the flows. This currently is implemented by the OpenFlow framework
which has been deployed at many institutions across the world. You can buy OpenFlow
switches from several vendors like NEC and HP. I think that OpenFlow is great. I think it's a
great concept but its original design imposes excessive overheads. Now to see why this is, let
me explain how OpenFlow works. This is what a traditional switch looks like where we have the
data plane and the control plane in the same box. So the data plane just for its packets, and the
control plane exchanges reachability information and then builds routing tables based on that.
However, OpenFlow separates these two, so it looks something like this, where we have a
logically centralized control plane at a central controller and then OpenFlow switches are very
dumb switches that just forward packets, so any time a packet arrives at this OpenFlow switch
that it doesn't have a forwarding entry for, it has to forward it to the central controller and the
controller decides how to route that flow and then it inserts forwarding table entries in all of
the switches along the path flow. So the reason we want OpenFlow in the data center is
because it enables some pretty innovative network management solutions. Here is a partial
list. A few of the things that we can do with OpenFlow are things like consistently enforce
security policy across the network. It can be used to implement data center network
architectures such as VL2 and Portland. It can be used to build commodity, excuse me load
balancers from commodity switches and relevant to me it can be used to do flow scheduling to
maximize throughput, or it can also schedule flows to build energy proportional networks, so
this works by scheduling the flows on the minimum number of links needed and turning off all
the unnecessary equipment. So it's great that OpenFlow can do all of these things, but
unfortunately it's not perfect and the reason why it's not perfect is because it has these scaling
problems, so implementing any of these solutions in a midsized data center will be quite
challenging. So our contributions with this work are first we characterize the overheads of
implementing OpenFlow in hardware and, in particular, we found that the overheads of
OpenFlow, there's obviously a bottleneck at the central controller since all flow setups need to
go through the central controller, that creates an obvious bottleneck. But we found that that's
not the real bottleneck. The real problem is at the switches themselves. We found that
OpenFlow is very hard to implement in a high-performance way in the switching hardware. So
to alleviate this we propose DevoFlow, which is our framework for cost-effective scalable flow
management and then we evaluate DevoFlow by using it to perform data center flow
scheduling. So I don't have enough time today to go over the overheads of OpenFlow so I'm
just going to skip to this and then go into our evaluation.
>>: Can I asked just a quick question before you go on?
>> Andy Curtis: Yes.
>>: I wasn't sure I understood. So you said I would have expected the problem to be that every
single flow center [inaudible] whenever why has to go through a centralized point, but you said
it wasn't that that it was is hard to put OpenFlow into individual switches but I'm not sure which
part of the [inaudible] understand the statement that it seems like OpenFlow is always running
in a centralized place and not always [inaudible].
>> Andy Curtis: So OpenFlow also has to run in switches. Let me go, switch to my backup slides
here. So I didn't really show the full picture of how this architecture looks like, so we have the
A Secure which does this, hardware specialized for forwarding packets, but also switches the
CPU and that's used for management functions, and so any time we do this flow set up through
the ASIC, it has to go through the CPU and the reason it goes through the CPU is because it
needs to perform SSL and also it needs to perform TCP between the switch and the controller
and so this CPU in switches today is pretty wimpy. It can't handle the load of setting up all of
the flows and so we did simulations and we said well, obviously let's just put a bigger CPU in
there and maybe that will work. We found the CPU would need to be two orders of magnitude
faster than what is currently implemented in HP's hardware at least. We're not sure about
other manufacturers, but for HP it would need to be two orders of magnitude faster.
>>: [inaudible] talking to centralized them?
>> Andy Curtis: Yeah.
>>: So what kind of hardware is it to centralized them?
>> Andy Curtis: Essentially the thing is that it could be a server or whatever you want.
>>: So you're saying this switch controls [inaudible] CPUs are so slow that even though there
are 200 times as many of them they have a bottleneck.
>> Andy Curtis: Yes. Unless you can go, I mean, for hundreds of thousands of servers then the
centralized controller could be a bottleneck, but you can still distribute it to alleviate the
pressure.
>>: [inaudible] really quick how you review determined that [inaudible] CPU load is 100%
over…
>> Andy Curtis: Right, so CPU loads 100% and so we measure, yeah you can measure the CPU
load is 100% and so we perform this experiment where we try to just serially set up flows and
we found that the 5406 switch could only set up 275 flows per second and then the CPU was
100%. Yeah.
>>: So is it setting up a new SSL connection every time?
>> Andy Curtis: It's not, no, so it's, it sets that up just once. I mean it's not that dumb to do it
every time, but it's still very high overhead at the control point in the switch. I mean, to show
you we can expect bursts of up to 10,000 flows at an edge switch in the data center, so it's 40
times more than the switch can currently handle and it does create a lot of latency in this flow
set up, so we measured the amount of time it took to just set up this flow and it could take 2
milliseconds.
>>: Why is there [inaudible] so is so severely handicapped?
>> Andy Curtis: I mean, part of the problem is that these are commodity switches. Like if you
were to go out and buy a high-end router, I mean they could probably get rid of a lot of these
problems. The second problem is that these switches right now aren't designed for OpenFlow,
so they're designed to do normal switching stuff and then they're adding OpenFlow on top of it
and for the major vendors now they're not going to let OpenFlow drive their switch
development for at least five years or so. I mean I do think it's a great opportunity for some
startup to come in and build specialized OpenFlow switches.
>>: [inaudible] is it [inaudible]?
>> Andy Curtis: Just an edge switch. So yeah, in the discounts and measurements, they show
that you can have bursts of up to 10,000 flows per second. So the way that DevoFlow works is
we want to find the sweet spot between the fully distributed control and fully centralized
control. So DevoFlow stands for devolved OpenFlow and the idea is that we're going to devolve
control of our most flows back to our switches. So our design goals were this. We want to keep
most flows in the data plane and that's to avoid the latency and the overheads of setting them
up at the central controller. Then we want to maintain just enough visibility for effective flow
management, so we only want to maintain visibility over the flows that matter to you, not all of
the flows. And then our third goal is to actually simplify the design and implementation of highperformance switches, because as I said, it's difficult to do this in OpenFlow right now. You
need a fast switch CPU. If you really increase the speed of the CPU you may have to rearchitect
the switch itself, so we want to do away with all of that by keeping most flows in the data
plane. So the way we do this is through a few different mechanisms. We propose some control
mechanisms and some statistics gathering mechanisms. Now one thing I didn't talk about is
that collecting the statistics from flows also has quite high overhead and it's because OpenFlow
offers one way to gain visibility of your flows and that is you can ask it for all of the forwarding
table counters and say how many bytes did each flow transfer in the last, you know, since I last
polled you. By doing this, this is very high overhead because you have to get the statistics for
every single flow out the switch. So the control mechanisms we propose, the first we call rule
cloning. And the idea here is that the ASIC itself in the hardware should clone, should be able
to clone a wildcard rule. OpenFlow has two different types of forwarding rules. It has wildcard
rules which can have wildcards and exact match rules. So the rule cloning, the way it works is if
a flow arrives and it matches say this rule, then it can duplicate that and add a specific entry for
that flow into the exact match table. The reason we want to do this is so that we can gain
visibility over these flows. We can pull the statistics and just gain visibility over the flows that
we cloned. Then the other things that we proposed are some local actions, so rapid rerouting
so you can specify fallback ports and so on if the port failure happens and then also we
proposed some multipath extensions, so this allows you to select the output port for a flow
according to an arbitrary probability distribution, so if a flow arrives, you can select its output
according to some distribution such as this, and this allows you to do static load balancing on
mesh topologies.
>>: Are you proposing [inaudible]?
>> Andy Curtis: Implementing this is a little bit harder. All of these things, the only things we
haven't implemented DevoFlow in hardware, so all of these things we talked to HP's ASIC
designers and they assured us that this should be relatively easy to do in the hardware.
>>: [inaudible].
>> Andy Curtis: Right, so I'm not saying per packet, I'm saying per flow.
>>: That's the trick right there. If you're not doing per packet [inaudible] numbers okay the
desired statistics becomes really hard because then, you know, who knows was going to
happen. If you do it by packet it's easier to get the distribution you want. You start doing per
flow and then you have to estimate flow sizes and things like that.
>> Andy Curtis: Right. We are not taking that into account, but the theory says that if there are
lots of mice, then this gives you optimal load balancing. There are lots of elephant flows then
you can run into these problems that you mentioned. And I'll get into that, so we're actually
going to do some flow scheduling to schedule the elephant flows exclusively. The statistics
gathering mechanisms that we proposed are first of all you can just turn on sampling. Most
commodity switches already have sampling and in our experiments we found that this does
give enough visibility over the flows. Another is triggers and reports, so you can set a rules that
if a forwarding table rule has forwarded a certain number of bytes then it will set that flow up
at the central controller so this allows you to gain visibility over just elephant flows. And then
there is an additional way that we can add approximate counters which allow you to track all
the flows matching a wildcard rule. This one is much harder to implement in hardware so I
won't use it for my evaluation. But the idea here is that unlike OpenFlow, we want to provide
visibility over a subset of flows instead of all the flows. So I mentioned that we haven’t
implemented it but we can use existing functional blocks in the 86 for most mechanisms. So
DevoFlow provides you the tools to scale your software to find networking applications,
however, it still might be quite challenging to scale it. Each application will be different. The
idea is essentially you need to define some sort of notion of a significant flow to your
application. In flow scheduling, which am going to show, it's easy to find the significant flows.
They are just elephant flows. For other things like security, it may be more challenging and so
that's why for now I'm only going to show how to do flow scheduling with DevoFlow. So the
idea here is that new flows that arrive are handled entirely within the data plane by using these
multipath forwarding rules for new flows, and then the central controller uses sampling or
triggers to detect elephant flows and the elephant flows are dynamically scheduled by the
central controller and then the scheduling is done using a bin packing algorithm. So in our
evaluation we want to answer this question. How much performance, how much can we lower
the overheads of OpenFlow while still achieving the same performance as a fine-grained flow
scheduler. We found that if you perform this flow scheduling, you can increase the throughput
37% per shuffle workload on a Clos network. 55% on a 2-D hypercube and these numbers
really depend on the workload, so I also reverse engineered a workload published by Microsoft
Research, and I didn't see any sort of performance improvement using this workload. And a
bigger reason for this is that I don't have, there's no sort of burst of traffic in this. It's just sort
of flat. As I reverse engineered this from the flow inter-arrival times and the flow, the
distribution of flow sizes.
>>: I understand that evaluation. You are saying that you wanted to reduce the overhead on
the control plane while keeping the same performance, but your results are increases in
performance?
>> Andy Curtis: Right. I'm going to get into that. I just want to show you these are if you do
flow scheduling, these are kind of the increases in performance you can get with it, versus
DCMP, using just static load balancing.
>>: [inaudible] fine grain flow scheduling versus random DCMP essentially.
>> Andy Curtis: Yeah, so this is fine-grained over DevoFlow versus randomized with DCMP.
Here are the results of the overheads. This, the vertical axis is showing us the number of
packets per second to the central controller. This is simulations where we reversed, where we
simulated OpenFlow based on our measurements of our real switch. So you can see that if we
used OpenFlow stats pulling based mechanism to collect statistics. Then we had about 7700
packets per second going into the controller. If we used DevoFlow based mechanisms such as
sampling or this threshold to gain visibility over the flows, then we can reduce the number of
flows, excuse me the number of packets to the controller by one to two orders of magnitude,
and again, this is just because we're only gaining visibility over the elephant flows here. At least
with the thresholds this is only the elephant flows and then samplings, we are collecting
samples for every single packet, but it's still lower than collecting the statistics on every single
flow.
>>: What is the cost in performance?
>> Andy Curtis: There is no cost in performance. For the same performance this is the
decrease in overheads. So this is showing the number of flow table entries at the average edge
switch. So here with OpenFlow we have over 900 flow table entries; with DevoFlow we can
reduce that by 75 to 150 times, and this is because most of flows are routed using one single
multi-path forwarding rule and then we only need to add specific exact match entries for
elephant flows.
>>: How are you evaluating [inaudible] because this wasn't implemented; it was simulated,
right?
>> Andy Curtis: Right. So it's simulated at… So I basically simulated a lot of different scenarios
and found, so I sort of did like a binary search in the performance to get them the same.
>>: [inaudible].
>> Andy Curtis: The performance is aggregate throughput.
>>: Okay.
>> Andy Curtis: So here on this workload we're getting the same aggregate throughput using
the fine-grained scheduler as DevoFlow and these are the overloads.
>>: Okay. So you're taking your, you're assuming your traffic pattern of some kind, and you're
simulating what decisions were made as to which point gets which flow and your figuring out
how much aggregate was there.
>> Andy Curtis: Right.
>>: So then you're not simulating and a [inaudible] and packets or…
>> Andy Curtis: No. We are doing fluid level simulations here.
>>: Okay. Because TCP sometimes has cliff performance that just dropping twice as many
packets might give you 1/10 the bandwidth.
>> Andy Curtis: Right. So we're not, these are sort of flow level simulations here.
>>: So is this DevoFlow similar to [inaudible] in the sense that you only schedule for the big
flows except for Hadera did that [inaudible] did this thing in the switch?
>> Andy Curtis: So Hadera [phonetic], this is essentially Hadera here because they are using
OpenFlow statistics pooling mechanisms to gain visibility over the elephant flows. So this is
what Hadera [phonetic] does and then DevoFlow is using our more efficient statistic gathering
mechanisms to only look at the flows that matter, the elephant flows. So to summarize
DevoFlow, we first characterize the overheads of OpenFlow and then we propose DevoFlow to
give you the tools you need to reduce the reliance on the control plane of your software to find
networking applications and then we showed that at least for one application which is flow
scheduling, it can reduce overhead by 1 to 2 orders of magnitude. I want to just briefly
summarize the cost savings that are possible with my, you know, because of my results here.
The network is 5 to 15% of, and this is just network equipment, is 5 to 15% of the total cost of
ownership of a datacenter. Legup can basically cut the cost of your network in half. Rewire can
cut it by as much as an order of magnitude, so then you can significantly save money on your
total cost of ownership of your datacenter with these two approaches. And then server
utilization is often low because of the network limitations, so if you can extract more bisection
bandwidth from your network, then you might be able to deploy fewer servers. Since the
servers are the majority of the cost of your datacenter, you may be able to shave some cost
savings there as well. So to go over my future work, I want to do quite a few things. First of all I
would like to work on a few, a bit more theory type results and I think it's really interesting. I
did use expander graphs as a data center network, so expander graphs if you're not familiar
with them they are graphs that are essentially rapidly mixing and in the Rewire work we
compared, we switched out our objective function and instead of saying okay, maximize
bisection bandwidth, we asked them to maximize the spectral gap of the graph, which is the
notion of the good expander. We found that these graphs actually performed extremely well
and so I'm really interested. I think there is a connection between expansive properties of the
graph and the bisection bandwidth, and so I'm going to be interested in exploring that further.
There is quite a bit of systems work I want to do. I think that it would be really cool to go out
and build an architecture specifically designed for unstructured data center networks. Like I
said HP has one, but I don't think it's the last word on that. I think there are a lot of problems
still open and managing, you know, multiple data centers. I'm interested in inter-data center
networks. Some other things that I have an ongoing project and is deadline aware big data
analytics, adding deadlines to these big data type queries. Another thing I'm interested in is
green networking systems. We have submitted a paper that is on reducing carbon emissions of
internet scale services. So if you think about my work so far, I have worked on a datacenter
infrastructure part of things and I sort of want to move up the stack. I want to move up to work
on cloud computing, big data analytics, but more on top of that I want to work on applications
on top of big data analytics, on top of all these things. I want to work on things like how can we
upgrade our cities? How can we apply these same sorts of analytical and theoretical techniques
to design transportation systems to jointly design smart grids, transportation systems, to
design, also city services like police and fire services. And then also one thing I would like to
apply these things to is building zero energy legacy buildings, so people have ways right now of
building, you know, buildings that use no energy whatsoever. They are completely carbon
neutral. I'd like to develop low-cost ways to retrofit existing buildings to be the same way. And
I think this is really a grand challenge because in the next 40 years we can expect 2 to 4 billion
people to be moving into the cities so our cities are going to grow tremendously. The number
of cars on the road will double. If we don't have smart people thinking of ways to solve these
problems, then we are going to have over congested roads and too many people moving into
crummy neighborhoods with bad infrastructure. One project I've already worked on this is
looking at the return of investment for taxi companies transitioning to electric vehicles and we
do find that with today's gas and electricity prices, it's actually profitable for taxi companies
right now to move to electric vehicles. To conclude, I developed a theory of high-performance
heterogeneous interconnection networks. I built two datacenter network design frameworks,
Legup and Rewire. My evaluation of these shows that they can significantly reduce the cost of
data center networks, and then I proposed DevoFlow for cost-effective scalable flow
management. And that's it. Are there any questions? [applause]. Yes?
>>: So when you were talking about the evaluation of Rewire, you said something really
interesting. You said that I was asking if you compared it against a manually design network
and as part of the answer you said well, the guys, the Rewire solution, we hadn't thought of
doing it that way but we would never actually do it because it would be difficult to manage. So
I'm wondering are these physically realizable? Are they practical?
>> Andy Curtis: I think they are.
>>: They didn't think they were. [laughter].
>> Andy Curtis: I mean IT guys, IT guys are resistant to change, so the cabling would be a
problem. At smaller scale like in a container it would be doable and then you would probably
have to have a different solution or you could run Rewire on the inter-container scale. And
then I think really the hard problem right now is how do you route on this big of an arbitrary
mesh, and that is going to take some work. Yes. But the purpose of my work of doing this is to
see well what's possible. If we get away from what we're doing right now and we think ahead
you know, how much better networks could be built? And that's really the question that
motivated that work. Yes?
>>: [inaudible] DevoFlow thing, so you talked about the security thing and so the OpenFlow
when they first talked about this [inaudible] thing they said [inaudible] control. That's
something that OpenFlow can provide; it's a great thing. But then they talked about what was
mostly the enterprise networks and so I assume that the datacenter domain maybe that kind of
[inaudible] network hard to control. Is not that important so maybe that kind of [inaudible] the
future. Not that necessary so maybe, you know, just find to do some kind of like control plane
pushing some of the list [inaudible] switches and so on so that you can reduce this overhead
[inaudible].
>> Andy Curtis: I want to emphasize that we weren't the first people to say let's use OpenFlow
in the data center. Other people have done that before us and so we sort of looked at all that
work and we said well, look this is really cool work, but will it work? And that's sort of the goal
of the DevoFlow work was to give people ways to actually use OpenFlow in the data center. I
mean I don't necessarily think that OpenFlow is the right solution for security in the data center
and…
>>: Why's that?
>> Andy Curtis: Because, as you mentioned, per flow security in the data center might not be
doable, and so we may need other things, but one way you could use DevoFlow is on a per
flow, not on a per flow but on a categorical basis, apply some sort of route your traffic through
a set of middle boxes that apply these features. That could be doable with using these wildcard
rules. Yeah?
>>: I have a question on your DevoFlow evaluation. So you said you found a [inaudible] is the
CPU on each switch for the flow set up and that limits the number of [inaudible]. But then the
evaluation was looking at metrics like packets per second statistics and 40 table sizes and things
like that, were you trying to use these as a proxy for flow set up or, there was a missing link
there that I didn't get.
>> Andy Curtis: Yes. We were trying to use that as a proxy for flow set up. Like if we were
setting only an order of magnitude fewer flows up through the controller that means only an
order of magnitude are being set up through the switch CPU as well.
>> Ratul Mahajan: If there are no other questions, please think our speaker again. [applause].
Download