>> John (JD) Douceur: So I would like to... candidate at UCSD currently it advised by Amin Vahdat and...

advertisement
>> John (JD) Douceur: So I would like to introduce Meg Walraed-Sullivan who is a PhD
candidate at UCSD currently it advised by Amin Vahdat and Keith Marzullo. Despite the fact
that she has not finished her PhD, she actually already has quite a history with Microsoft
between her Masters and PhD she worked for a year in the Windows Fundamentals Group
working on appcompat and she's also had to post docs with Doug Terry at MSRSVC.
>> Meg Walraed-Sullivan: Internships.
>> John (JD) Douceur: Excuse me?
>> Meg Walraed-Sullivan: Internships.
>> John (JD) Douceur: Internships, excuse me.
>>: Very, very short post docs. [laughter].
>> John (JD) Douceur: Very, very short post docs, really short.
>> Meg Walraed-Sullivan: Summer postdocs [laughter].
>>: Pre-docs.
>> John (JD) Douceur: Pre-docs thank you. And in fact we have at least one other person in the
audience who has been an intern of Doug Terry. For the next few days she will be interviewing
with us for a postdoc, got it right that time, position with the distributed systems for operating
systems research group. Take it away.
>> Meg Walraed-Sullivan: Thanks JD. Today I'm going to talk you about label assignment in
data centers and this is joint work with my colleagues Radhika Niranjan Mysore, Malveeka
Tewari from UCSD, Ying Zhang from Ericsson Research and of course my advisors. What I really
want to tell you about today is the problem of labeling in a distributed network. Anytime we
have a group of entities that want to communicate with each other, they are going to need a
way to refer to one another. We can call this a name, an address, an ID, a label. I'm going to
take the least overloaded of these terms and say a label. To give you an idea of what I mean by
a label, historically we have seen labels all over the place. Your phone number is your label
within the phone system. Your physical address as far as snail mail or for internet type things
my laptop has a Mac address and IP address) so those are labels. So the problem with labeling
in the data center is actually a unique problem because of some special properties of data
centers. So when I talk about a data center network what I mean is an interconnect of switches
connecting hosts together so that they can cooperate on shared tasks. These things as I'm sure
you know are absolutely massive. We are talking about tens of thousands of switches
connecting hundreds of thousands of servers, potentially millions of virtual machines, and just
to drive this point home I've got the canonical, yes data center is our big picture here, I'm sure
you're aware [laughter]. So another property that's interesting about data structures is that we
tend to design them with this nice regular symmetric structure. We often see multi-rooted
trees and an example of a multi-rooted tree is a fat tree which I've drawn on this slide. But, the
problem is even the best laid plans; reality doesn't always match the blueprint. As we grow the
data center we're going to be adding and removing pieces. We are going to have links,
switches, hosts all sorts of things failing and recovering. We could have cables that are
connected incorrectly from the beginning or maybe somebody's driving a car down between
racks and knocks a bunch of stuff out and puts it back incorrectly. Basically we don't get to take
advantage of this nice regular structure all of the time or at least we can't expect it to always be
perfect. So just to give you some context, what might we label in a data center network?
Ultimately we have end hosts trying to cooperate with each other. So a host puts a packet on
the wire and it needs a way to express where this packet is going. It needs a way to say what it
is trying to do. So some things that we might label, for instance, are switches and their ports,
host NICs, virtual machines, this sort of thing just to give you some concept of what we're trying
to label here. So for now let's talk about the options that we have today. On one end of the
scale we have flat addresses and the canonical example of this is Mac addresses assigned at
layer 2. Now these things are beautiful in terms of automation. They are assigned right out-ofthe-box. They are guaranteed to be unique for the most part so we don't have to do any work
in assigning them and that's fantastic. On the other hand, we run into a bit of a scalability issue
with forwarding state. Now switches have a limited number of forwarding entries that they can
store in the forwarding tables. So this means that they have a limit in terms of a number of
labels that they can know for other nodes in the network. We're talking about flat addresses.
We run into this problem where each switch is going to need a forwarding entry for every node
in the network and at the scale of the data center this is just more that we can fit our switches
today. Now you could argue that we should just buy bigger switches, but remember we are
buying tens of thousands of them so we are probably going to try to stick to the cheapest
switches we can. So from there we might consider a more structured type of address and
usually we see something hierarchal and the canonical example of this is IP addresses assigned
with DHCP. Now here we solve the issue of the scalability in the forwarding state. This is
because what we do is we have groups of hosts sharing IP address prefixes and so a group of
hosts can actually take their prefix and have this be corresponding to one forwarding entry in a
switch farther away in the network. So we allow sharing and this compacts our forwarding
tables. On the other hand, if we're looking at something like IP, someone's got to sit around
and figure out how to make these prefixes all work. Someone's got to participate in the
address space and spread it across the network appropriately and configures subnet masks and
DHCP servers, make sure all of the DHCP servers are in sync with themselves, in sync with the
switches and this is really unrealistic to expect anyone to do. This is a significant pain point of
scale. So more recently there have been several efforts to combine the benefits that we see at
layer 2 or layer 3 and try to address these things. I'm only going to talk about two today. These
are the two that are most related to the work that I'm going to tell you about. These two are
Portland's location discovery protocol which was done by my colleagues at UCSD and then
MSR's DAC, data center address configuration. Now one point I want to make about both of
these is that they both somewhat rely on a notion of manual configuration via their leverage of
blueprints. So there is some notion of intent of what we want the topology look like, but more
importantly, both of these systems rely on centralized control. Now I can make the usual
comments about centralized control. There's a bottleneck. There is a single point of failure,
that sort of thing, but I think that these things have largely been addressed. What I really want
to talk to you about is this idea of how we get the centralized controller connected to all
100,000 nodes. This is a problem. They can't all be directly connected to the centralized
controller, otherwise we have 100,000 port switch and that's pretty cool, but we don't have
that. So in order to get all of these components to be able to locate and communicate with our
centralized controller, we're going to need some kind of separate out of band control network
or worse we're going to have to flood. So you could say an out of band control network is not
so bad. It's going to be much smaller than our current network, but relatively smaller than
absolutely massive is still pretty big. So someone's going to need to deploy all of the gear for
this and it's going to need to be fault tolerant and redundant and somebody is going to need to
maintain it, so this is again something that we run into a problem that we are trying to avoid
before. We don't want to have to administer and maintain all this stuff. So let's try to tease
out the trade-off that we're really looking at here. What we have is some sort of trade-off
between the size of the network that we can handle and the management overhead that we
have associated with assigning labels. And this trade-off looks like this. As the network size
grows, it's more management overhead and this is just a concept graph. It's not meant to have
any particular slope, maybe it's not even a line, but the trend is up and to the right. The
interesting thing about this concept graph is that someplace along the lines of networking size,
of network size, we have some sort of hardware limit. And to the left of this limit we're talking
about networks that are small enough that we can afford a forwarding entry in the forwarding
tables per node in the network. And so to the left of the network we are free to use flat
addresses like Mac addresses. However, to the right of this limit, this is where we run out of
space in the forwarding tables and we can't fit an entry per node in the network, so we're going
to have to embrace some structure in our addresses, some sort of hierarchal label. So just to
give you a frame of reference, we talked about ethernet and ethernet sits to the bottom left of
this graph, almost no management overhead but small networks. IP on the other hand is going
to sit towards the top right. We can have very large networks, but we have to deal with the
management overhead. So like LDP and DAC, our goal here is to try to move down vertically
from IP to get to this target location where we get to support really large networks with less
management overhead. The way I'm going to show you how to do that today is with some
automation. Now of course we all know that there is no free lunch. There is a cost to
everything, so what I want to point out is if we are going to embrace the idea of automation,
then the network is going to do things for us on its own time. Now we can set policies for how
it's going to do things, what it's going to do, but ultimately it is going to react to changes for us.
Additionally, if we're going to do something with structured labels, that means that our labels
are going to encode the topology. That means that when the topology changes those labels are
going to need to change and since the network is taking care of things automatically, it's going
to react and change those labels. So this is a concept that we're going to have to embrace, the
concept of relabeling where when the topology changes the network is going to change labels
for us. So now that you have a rough idea of what we're trying to do, what I'm going to present
you today is ALIAS. It's a topology discovery and label assignment protocol for hierarchal
networks and our approach with ALIAS is to design hierarchal labels so we get this benefit of
the scalable forwarding state that we see with structured addresses, to assign them in an
automatic way so we don't have to deal with management overhead of 100,000 nodes, and to
use a decentralized approach so we don't need some sort of separate out of band control
network. Now the way that I like to do systems research is a little bit different. I like to look at
things from this kind of implementation, deployment, measurement side of things as well as
from a more formal side, from a proof and formal verification side. So today what I'm going to
be talking to you about is actually two complementary pieces of work. One that was in
implementation and deployment type thing that was in a symposium on cloud computing last
year and one that is a more theoretical piece of work from distributed computing last year and
these two things actually combined together to form ALIAS, this topology discovery and label
assignment protocol that I've just introduced to you. Just to formalize the space that we are
looking at right now, what we often see in data center networks is multi-routed trees, and what
I mean by a multi-routed tree is a multistage switch fabric connecting hosts together in an
indirect hierarchy. Now an indirect hierarchy is a hierarchy where we see servers or hosts at
the bottom at the leaves of the trees or switches instead of connected to arbitrary switches.
We also often see peer links, and by peer links I mean a link that I have drawn horizontally on
this picture so a link connecting switches at the same level. Now one thing I want to call your
attention to about this graph, and this is just an example of a multi-routed tree is that we have
high path multiplicity between servers, so what this means is that I take two servers and look at
them, there are probably many paths by which they can each reach each other, maybe some
link or node disjoint paths. And our labels are ultimately going to be used for communication,
so it would be nice if at the very least our labels didn't hide this nice path multiplicity, if they
allowed us to be able to use that when we communicate over the top of them. So I want to
give you a very brief overview of what ALIAS labels look like just to give you the concept. In
ALIAS switches and host have labels and labels encode the shortest physical path from the root
of the hierarchy down to a switch or a host. So there might be multiple paths from the root of
the hierarchy down to the switch and so that switch may have multiple labels. So to give you
an example that teal switch labeled G actually has four paths from the root of the hierarchy, so
it's got four labels and similarly its neighboring host H in orange has four labels as well. So as
you can see we've encoded not only location information in these labels, but ways to reach the
nodes. Now a few slides ago I made a comment about having too many labels and now I've just
introduced the concept of having multiple labels per node, so it would seem I just made the
problem worse instead of better. But in a few slides what I'm going to do isn't going to show
you how to leverage the hierarchy in the topology to compact these things, to make some
shared state so that we don't have so many labels. Now almost any kind of communications
scheme would work over ALIAS labels. Obviously something that leverages the path encoding
in these labels and the hierarchical structure would be the most clever. We implemented
something that actually does leverage this, but I won't have much time to talk about that today.
So I just want to give you some context in terms of what the forwarding might look like. So
what you can think of for the purpose of today's talk is some sort of hierarchical forwarding
where we actually pass packets up to the root of the network and then have the downward
path be spelled out by the destination label. So if someone wants to get a packet to host H it
just needs to get that packet up to A B or C and then let the downward path be spelled out by
the destination label of each. So what do these labels really look like and how do we assign
them? What does this protocol look like? Well, ALIAS works based on immediate neighbors
changing state at some tunable frequency. What I mean by immediate neighbors is we never
gossip anything past anyone directly connected to us. And we have four steps in the protocol
that operate continuously. Now when I say continuously, what I mean is they operate as
necessary. If something changes, then state begins getting changed again. If someone has
nothing new to say, then the state exchange just reduces to a heartbeat. And these four steps
are going to be the following. First we overlay hierarchy on the network fabric. Now
remember when a switch comes up it has no idea what the topology looks like, has no idea
where it is in the topology, what level it's at and so on, so it needs to figure this out. Next we’re
going to group sets of related switches into what we’re going to call hypernodes. After that
we're going to assign coordinates to switches. Now a coordinate is just a value from some
predetermined domain. So for the purpose of this talk we are going to use the English alphabet
as the domain and letters as coordinates, but it's just any value from the domain. And lastly
we’re going to combine these coordinates to form labels. So I'm going to tell you about each of
these steps in detail so that they make a little bit more sense. So first thing we have to do is
overlay hierarchy. So in ALIAS what we say is that switches are at level one through n where of
the one at the bottom of the tree and level n is at the top. We bootstrap this process by
defining hosts to be at level 0. So the way our protocol works is when a switch notices that it is
directly connected to a host, it says hey, I must be at level one. And then during this periodic
state exchange with its neighbors, its neighbors say hey, my neighbor labeled itself as level one,
I must be level two. And so on for level three and this can work for any size hierarchy. This can
continue up any size tree and the beauty of this is that only one host needs to be up and
running for this process to begin. So now that we have overlaid hierarchy on the fabric, our
next order of business is to group sets of related switches into what we are going to call
hypernodes.
>>: If a switch doesn't have any hosts [inaudible], then it's going to think it's high in the
hierarchy rather than low [inaudible] change once [inaudible] comes up?
>> Meg Walraed-Sullivan: Yes. So that switch will be pretty much useless at the top of the
hierarchy, so it will be have some pass potentially that it can relay things, but it probably won't
have been meant to be at the root of the hierarchy, so it probably won't have a ton of
connections, so it will just sit at the top of the hierarchy until its host connects it and pulls it
downward. So what are these hypernode things and what are we trying to do with them?
Remember labels and code paths from the root of the hierarchy down to a host, and so
multiple paths are going to lead to multiple labels and we said we needed a way to aggregate
these, to compact these. We're going to do this with hypernodes. What we’re going to do is
we're going to locate sets of switches that all ultimately reach the same host on a downward
path, so sets of switches that are basically interchangeable with respect to reachability moving
downwards. Just because I'm going to use this picture for several slides, I just want to point out
that we've got a four level tree of switches here and we've got the level 0 host at the bottom
which are invisible because of space constraints. So in this particular picture, as you can see
we've got two level two switches that are highlighted in orange. These two level two switches
both reach all three level one switches and therefore they both reach all of the hosts connected
to these level one switches and so with respect to reachability on the downward path, these
two switches are the same. They are interchangeable. So let's formalize this. A hypernode is
the maximal set of switches at the same level that all connect to the same hypernodes below,
and when I say connect to a hypernode below I mean via any switch in that hypernode, maybe
multiple switches in that hypernode. Of course, this is a recursive definition so we need a base
case and our base case is that every level one switch is in its own hypernode. Now I think this is
actually a tricky concept, so I'm going to go through the example in some detail. So we have
three level one switches, each in their own hypernode. Then at level two we have two
hypernodes. As you can see this switch to the left in the light blue or teal only connects to two
of the level one hypernodes, whereas the two switches on the right as we saw in the previous
slide connect to all three. That's why we have the level two switches grouped this way. If you
look at level three, we've got the orange switch all the way over to the right and that only
connects to the dark blue level two hypernodes, so it is by itself. On the other hand, we've got
the two yellow switches. Those both connect to both level two hypernodes and so they are
grouped together. Now notice that those two yellow switches actually connect to the dark blue
hypernode via different members but they both connect to the same to hypernodes. So to
reiterate why we are doing this, remember that hypernode members are indistinguishable on
the downward path from the root. They are sets of switches that are ultimately able to get to
the same sets of hosts.
>>: So they are interchangeable for connectivity but not for load balancing.
>> Meg Walraed-Sullivan: Yes, just for reachability. Yep. So now that we've grouped into
hypernodes our next task is to assign coordinates to switches. Now remember a coordinate is
just a value pick from a domain; for this talk it's going to be letters picked out of the English
alphabet. So let's think about what these coordinates might look like and how they are going to
enable communication. Remember we are ultimately going to use them to form labels and
those labels are ultimately going to be used to route downwards in the tree. So for example, if
we have a packet at the root of this tree and it needs to get to a host reachable by one of the
switches in the yellow hypernode, it may as well go through either. They both reach the same
set of things so we can forward through either for the purpose of reachability and will still get
to our destination. On the other hand, if we have a packet at one of the yellow switches and
it's destined for the bottom right of the tree, then it needs to go through the dark blue
hypernode. It can't go through the one on the left, that teal one, so what this tells us is we
don't need to distinguish between hypernode members. Switches in hypernode can share a
coordinate and then ultimately share the labels made out of this coordinate. On the other
hand, if we have two hypernodes at the same level that have a parent in common, they are
going to need different coordinates because their parent is going to need to be able to
distinguish between them for sending packets downward since they reach different sets of
hosts. So this seems like a somewhat complicated problem to assign these things, so the
question is can we make it any simpler. And what we're going to do is we're going to focus on a
subset of this graph and a subset of this problem. So we're going to look at the level two
hypernodes and think about how they might assign themselves coordinates. So these level two
hypernodes are choosing coordinates. Let's call them choosers. Now they're choosing these
coordinates with the help of their parent switches, because their parents are going to say you
can't have this coordinate because my other child has it, so let's call their parents deciders. So
at this point what we've done is we've pulled out a little bit of an abstraction that shows a set
of chooser nodes connected to a set of decider nodes in some sort of bipartite graph. This may
not be a full bipartite graph, but it is a bipartite graph. What we're going to do is we're going to
formally define this abstraction, this chooser, decider business and then we're going to right a
solution for this abstraction and then finally were going to map the solution back to our multirooted tree. To formalize, what we have is what we officially call the label selection problem or
LSP, and the label selection problem is formally defined as sets of chooser processes which are
shown in green, here connected to sets of decider processes which I've shown in red in a
bipartite graph. With an eye towards mapping back towards normal multi-rooted tree, these
choosers are going to correspond to hypernodes, so remember hypernodes may have multiple
switches but for now we're just collapsing that into one chooser node. Then the deciders are
going to correspond to these hypernodes’ parent switches. Now the goal, a solution to the
label selection problem is that all choosers eventually select coordinates, that we make some
progress. Also that choosers that share a decider have distinct coordinates. With an eye back
towards mapping back towards our tree, this is so that hypernodes will ultimately have distinct
coordinates when they need to, so one example of the coordinates that we might have in this
particular graph are the following. So chooser C-1, C-2, and C-3 share deciders, so they all need
different coordinates and that is shown in the example here. On the other hand, for instance,
C-1 and C6 don't share any deciders so they are free to have the same coordinate if they want
to. Formally we define a single instance of LSP as the set of deciders that all share the same set
of choosers. This is basically if you want to look at it this way, the set of maximal or full
bipartite graphs that are embedded in this one graph. Here we have actually three instances of
LSP in this graph. Note that a chooser can end up in multiple instances of LSP, for instance, C-4
and C-5 are both in two instances. And this begs the question of what do we do with choosers
when they are in multiple instances? Now in one hand what we could do is we could assign
them one coordinate that works across all instances. This means they only have to keep track
of one coordinate which is nice, but on the other hand if they have to have a coordinate that
works across all instances they may be competing with a few choosers from each instance and
this may give them more trouble in terms of finding coordinates that don't conflict with
someone else. Go ahead?
>>: Is there a disadvantage to having a big enough coordinate that it's, for example, hundred
and 60 [audio begins] or even [inaudible] Mac address?
>> Meg Walraed-Sullivan: So there could be. If we could use the Mac address as a coordinate,
that would be nice. However, we're going to end up tacking coordinates together to form
labels and we would like to limit the size of the labels.
>>: Why?
>> Meg Walraed-Sullivan: Well, we might want to use them; we want to use them for
forwarding. We might use them for rewriting in a packet, different things, so we want to keep
them to a reasonable size. So on the other hand, if we didn't assign a coordinate that worked
across multiple instances, what we could do is assign a coordinate per instance. This means
that C-4 and C-5 would each have two coordinates. Now this gives us the trouble of having to
keep track of multiple coordinates, but on the other hand we're keeping the sets of switches
that might conflict with each other smaller, and ultimately keeping the coordinate domain
smaller, which will give us smaller labels. So it turns out that either of these specifications are
just fine. We've actually implemented solutions with both; we've actually looked through both
and they are fine, but there is a nice optimization as we map back to ALIAS if we do the second.
Did you have a question?
>>: If you use a single link into the system do you break the uniqueness…
>> Meg Walraed-Sullivan: Next slide.
>>: Could you move to the next slide? [laughter].
>> Meg Walraed-Sullivan: I will in one second [laughter]. We decided to go with the one
coordinate per instance just because it gives us a nice property when we map back to ALIAS.
Just to show you how that works, what that means is C-4 and C-5 are each going to need two
coordinates, one for each instance that they are in and note that there are no constraints about
whether we have the same coordinate for both instances, whether we happen to pick different
ones et cetera; it's just for instance and if they happen to be the same, no worries. All right, so
here is your slide [laughter]. At first blush this seems like a pretty simple problem to solve. In
fact, our first question was can we do this with a state machine, with Paxos? The difficulty here
is, and remember, we are formally stating the problem; we're not actually solving it here, but
one of the constraints of this problem is that connections can change. When a connection
changes it is actually going to change the instances that we have going on and this doesn't map
nicely to Paxos. So for instance, no pun intended, if I add this link between chooser C-3 and
decider D-3, what this does is it actually pulls chooser C-3 into the blue instance and then C-3
needs to find a coordinate for that instance. So the difficulty here and one of our constraints of
this problem is that the instances can change and in fact we expect them to change. I'd like to
reiterate that any solution that implements the label selection problem and its invariance is a
perfectly fine solution. There are many ways that we can do this. We designed a protocol
which we call the decider, chooser protocol just based on what we were looking for in terms of
mapping back to ALIAS. So the decider chooser protocol is a distributed algorithm that
implements LSP. It's a Las Vegas style randomized algorithm in that it's only probabilistically
fast, but when it finishes it is guaranteed to be correct. So this is in contrast with a Monte Carlo
algorithm where we finish quickly but we're not guaranteed to be absolutely perfect when
we're finished. Now we designed this decider chooser protocol with an eye towards practicality
because we are going back into the data center and we want something very practical,
something that converges quickly and doesn't use a lot of message overhead. And we also
want something that reacts quickly and locally to topology dynamics, so this issue of
connections changing. We want to make sure that we react quickly to them and we want to
make sure that a link added or changed over here in the network is not going to affect labels
over here. We definitely don't want that. So to give you an idea of how the decider chooser
protocol works, our algorithm is as follows. We have chooser select coordinates
opportunistically from their domain and send them to all neighboring deciders. Let's suppose
choosers C-1 and C-2 select X and Y respectfully. They're going to send these to all of their
neighboring deciders. Now when a decider receives a request for a coordinate, if it hasn't
already promised that coordinate to someone else, it says sure, you can have that. In this case
neither of these deciders has promised anything to anyone so they both send yeses back to
these choosers and then of course they store what they promised. Now if the decider has
already promised a coordinate to another chooser, then it says no, you can't have that
coordinate for now. I promised it to somebody else, and here's a list of hints of things that you
might want to avoid on your next choice. If the chooser gets one no from any decider, it just
selects again from its coordinate domain and tries again. Now it does this avoiding things that
have been mentioned to it in the hints from its other deciders. Once a chooser gets all yeses,
it's finished and it knows what its coordinate is, so in this case both choosers got yeses and they
are finished. Now of course, I've just shown you the very simplest case here. There are all sorts
of interesting interleavings and race conditions and we have coordinates that are taken up
while they are in-flight or while they are promised by a decider that ultimately won't work out
and this sort of thing. There are all sorts of complicated issues here. I’ve just shown you the
simple case so you can see what the protocol looks like. Let's talk about mapping this back into
our multi-rooted tree. So what we have is at all levels in parallel we have as many as needed
instances of the decider chooser protocol running on a small portion of the tree. We have
several instances at each level on smaller chunks of the tree happening all in parallel, so each
switch is really going to function as two different things. It's going to participate in a chooser
for its hypernode and it's also going to be helping nodes below it switch their coordinates and
it's going to be functioning as a decider. So to look at this in more detail, at level one in our
example tree we have three hypernodes, so we have three choosers, and their parent switches,
the three deciders help them choose their coordinates. Now things get a little bit trickier as we
move up the tree. Remember that all of the switches in a hypernode or going to share their
coordinate, so they all need to cooperate to figure out what that coordinate can be. The
reason for this is that each switch in the coordinate might have a different set of parents and
the parents are going to impose restrictions about what the coordinate can be based on their
other children and what coordinates are already taken. We need every single switch in the
hypernode to participate in deciding what the coordinate can be or rather what it cannot be, so
we need input from every switch in the hypernode. Now our difficulty here is that the switches
in the hypernode are not necessarily directly connected to one another, and unless we suspect
some sort of full mesh of peer links at every level, we really can't expect them to be connected
to each other, so what we do is we leverage the definition of a hypernode to fix this.
Remember, a hypernode is a set of switches that all ultimately reach the same set of hosts, and
so if they reach the same set of hosts that means that there is one level one switch that
everyone in the hypernode reaches. So we select one such level one switch for each hypernode
via some deterministic function. In this case I've gone with a deterministic function of I drew it
farther to the left on the graph. So what we do is we use that shared level one switch as a relay
between hypernode members. This allows them to communicate and share the restrictions on
their coordinate from their parents. Because this is more theoretical work and we've changed
the protocol little bit, we need to give it a new name, so it is surprise, the distributed chooser
decider chooser protocol, because we've distributed the chooser across the level one switch in
the hypernode. Coming back to our overview, we have overlaid hierarchy. We've grouped into
hypernodes and we've assigned coordinates that are shared among hypernode members. Our
next task is to combine these coordinates into labels. This is actually quite simple the way that
we do this. We simply concatenate coordinates from the root downward to make these labels.
For instance, this maroon switch here has three paths through different hypernodes and so it
has three labels. Now to make this more clear if we didn't have hypernodes it would have six
labels, because there would be six paths. So what does this give us? Really what we get is that
hypernodes create clusters of hosts that share prefixes in terms of their labels, so that when we
have a switch here in the network, it can actually refer to a whole group of hosts over here by
just one prefix, so this is compacting our forwarding tables in the same way that you might
expect with something like IP except that we didn't pay for the manual configuration here. We
did this automatically. Now I would like to bring your attention briefly back to relabeling,
remember that was our no free lunch thing. So when we have topology encoded in the labels,
when we have paths encoded in the labels, if the paths change we are going to have labels
changing. We call this relabeling. An example of this is if I fail this link here shown in the
dotted red, it's actually going to split that hypernode into two hypernodes. This is going to
affect the labels nearby because remember labels are built based on hypernodes. So in our
evaluation we show that not only does this converge quickly, but also that the effects are local
just to the nodes right around this failure here. Now I know I said I wouldn't talk about
communication much, but I just want--oh, go ahead.
>>: So you showed that it’s local you show empirically or you choose that there's actually a
bound on [inaudible]?
>> Meg Walraed-Sullivan: We showed it was a model checker, so we verified that it happens,
and we also did some analysis to convince ourselves.
>>: So the punchline, if you will, is that if you get hierarchal labels but without manual
assignment…
>> Meg Walraed-Sullivan: Yes.
>>: So let me play the strawman alternative system and maybe you could compare what you
described [inaudible], so I'm going to run DHCP and then take an open-source DHCP server
[inaudible] bound and do some hacking on it and configure a essentially [audio begins] so that
each DHCP server doesn't just give out addresses, but it also claims portions of the address
[inaudible] DHCP server based on the hierarchy, so the top-level switches let's say they have
entire class a [inaudible] and then [inaudible] smaller and smaller [inaudible] distribution
[inaudible] recursively [inaudible] lowest layer you have DHCP servers that give out individual
addresses to host rather than dividing up something similar so you end up with still hierarchal
labels fully automated but also have the advantage of also being a real IP address which means
that existing routing works and you would also have [inaudible] addresses [inaudible]. So I
know [inaudible]…
>> Meg Walraed-Sullivan: No problem.
>>: What would be the difference between what I just described to what you described?
>> Meg Walraed-Sullivan: This is something that I would actually like to think about further. I
have given some thought about what we could do we distributed some sort of controller or
something among nodes and this kind of looks like that sort of thing. My first concern would be
getting all of the DHCP servers in sync with each other and agreeing with each other and who
tells them which portion of the address base they can have or if they decide amongst
themselves how do they sync up with each other. This is definitely something that we've been
looking at as current and future work. There are a lot of different ways that we can kind of
distribute something centralized, but logically distributed across certain portions of the graph.
>>: One more question?
>> Meg Walraed-Sullivan: Sure.
>>: So you showed an example of a local change [inaudible] failed, but isn't it the case that the
introduction or the unfailures of links can change the level in the tree, of the switch? It seems
like it could be very disruptive. Is that still [inaudible]?
>> Meg Walraed-Sullivan: That's a good question. So first of all in terms of just, in terms of
failing a link verses recovering a link, it turns out there actually is not much difference. When
you fail or recover a link what really matters in terms of the locality of the reaction is what
happens to the hypernode on top of that link, whether that hypernode ends up joining with the
existing one, splitting apart, that sort of thing. Now of course if you change certain links you
can change the level. If you break your last link to a host then you are going to move up in the
tree from level one.
>>: I mean a host off of the [inaudible] switch?
>> Meg Walraed-Sullivan: right, then that is going to pull that route switch down.
>>: Doesn't that disrupt the tree globally?
>> Meg Walraed-Sullivan: It does, not globally usually unless you do something…
>>: [inaudible] log n [inaudible]?
>> Meg Walraed-Sullivan: It depends where the failure actually is and it depends on how many
nodes are below the hypernode, so with your example, if you hang a host at the root of the
tree, this is actually going to pull this root down and this could have some bad effects in terms
of generating peer links where there used to be up down paths, right? So one of the things that
ALIAS gives you that I think is really nice is this notion of topology discovery. We have several
types of flags and alerts that you can set so if you don't expect, if you didn't build the network
with many peer links and you start to see many peer links, something is wrong. This isn't what
you intended and we send an alert, so lots of these things that are going to cause disruptions
about like this are going to be able to be found immediately and detected.
>>: So your point is that kind of change would in fact have dramatic effects and yet you
consider that insertion failure. It's not that all changes cause local perturbations, it's that
change is either cause local perturbations or they are an indication that you did something
horribly wrong [inaudible] effects.
>> Meg Walraed-Sullivan: That and also some of them that don't cause only local perturbations
are not going to matter, so if you pull a root switch down, then you are definitely losing paths,
but because of this nice path multiplicity, you're not losing connectivity, because there are still
several root switches that will provide that connectivity. So just a touch on the communication
that we would run over these labels, I know I promised I wouldn't talk about it but I want to
give you some context as to how we might use them.
>>: I just think about you were talking about the network that you had before the root switch
on there, if I just go and plug my laptop into the root switch because I'm debugging the
network…
>>: [laughter] don't do that. [inaudible]. [multiple speakers].
>>: Obviously, it's like the worst case, but it's something that could clearly happen. I mean…
>> Meg Walraed-Sullivan: Right, but of course we can set up our protocol, right, so that your
laptop doesn't act like a server.
>>: Well, what if there is somebody pushing a cart down and knocks out cables and plugs them
back into the wrong place…
>>: Well, but you said something interesting which is that it makes paths go away. That seems
like a strange property, that plugging the laptop into an empty port on a root switch will
eliminate--I mean I haven't removed any links.
>> Meg Walraed-Sullivan: Right.
>>: I have the same network except that I used one more port…
>> Meg Walraed-Sullivan: It doesn't make physical paths go away, but based on the structure
of the communication that we’re going to talk about and that we use, it will make some logical
paths go away, but we posit that there are still plenty of paths. But they're not actually going to
go away.
>>: It's just a weird property to have that you are losing the ability to route across certain links
because you added a host that is not doing anything, that is reading diagnostics.
>> Meg Walraed-Sullivan: So ultimately we are encoding topology into the labels and we are
restricting how we route across those labels, so we need some sort of well defined way of
structuring these labels and this is a cost that comes with that.
>>: So this is the fallout of the fact that [inaudible] that you use hosts to identify the trees…
>> Meg Walraed-Sullivan: Yes. So if you are willing to admit some sort of scheme where you
labeled nodes at particular levels, then you could make this go away, and that's not that
unreasonable a thing to request. We could say switches within a rack or something and so on.
>>: No. I just [inaudible] the overall affect [inaudible] the network [inaudible] what level to
expected.
>> Meg Walraed-Sullivan: So we wanted to opt for something automatic but you could make
this go away if you were willing to label. Just look at our communication scheme, what we do is
actually something very similar if you are familiar with the Portland work, to what they did for
communication. So as a host sends a packet, at the ingress switch at the bottom of the
network we actually intercept that packet; we perform a proxy APR mechanism that we've
implemented to resolve destination Mac address to ALIAS label, to one of the ALIAS labels for
the destination. Then we actually rewrite the Mac address in the packet with the ALIAS label.
Now I'm sure you're thinking right now, this means we have a limited size for the labels in this
particular communications scheme. There are also other schemes. So what we do then is we
forward the packet upwards and then across and then downwards in the network if we choose
to use any cross peer links. This is based on the up star, down star forwarding that was
introduced by Autonet. And then when the packet reaches the egress switch at the end of its
path, the egress switch goes ahead and rewrites the ALIAS label back for the, swaps it back out
for the destination Mac address, so the end hosts don't know that anything happened. If you
aren't willing to rewrite Mac addresses in packets, or if you didn't want to have fixed length
labels, then you could use something, encapsulation tunneling or some other way to get these
packets to the network. This is just to give you an idea of how you might implement this. So
now I would like to tell you about how we evaluated this. So again, we kind of approached it
from two sides, from the implementation deployment evaluation side as well as the prove it
and verify it side, but ultimately our goal was to verify that this thing is correct, that it's feasible
and that it is scalable. That's really what we care about. So I'm going to talk about correctness
first, because if it isn't correct it doesn't really matter if it's scalable. So our questions that we
wanted to answer as far as correctness were, first of all is ALIAS doing what we said it does and
does it enable communication. So to figure this out we implemented ALIAS in Mace which is a
language for distributed systems development. So you basically specify things like a state
machine, so if I receive this message, I send the following messages out. If a timer fires, I do
this. Now Mace is a fantastic language for doing this kind of development because it comes
with a model checker. If anyone isn't familiar with Mace and wants to be, please come talk to
me because I would love to tell you about it. What we did with the Mace model checker is we
verified first of all that the protocol was doing what I said it would, that I didn't write
tremendously buggy code. Then we verified that the overlying communication works on ALIAS
labels, that two nodes that are physically connected are in fact logically connected and can
communicate. This is actually a great use of the model checker because it turns out that there
are some strange graphs that we found with the model checker where communication didn't
work and it was based on an assumption that we had made incorrectly early on in the protocol
when we were designing it. So again, model checker is great because you can find these sorts
of things. The last thing that we verified was the locality of failure reactions, so making sure
that this invariant of if I fail a link here, no one over here changes, just making sure that holds.
Then because Mace is a simulated environment and we never trust these things, we ported it to
our test bed IECSD. Now we used the test bed that was set up for Portland so it already had
this up star, down star, forwarding set up, and so all we had to do is make sure that our labels
worked with an existing communications scheme. And they did work, and this gave us a way to
sanity check our Mace simulations. The other things we wanted to look at in terms of
correctness were first of all does the decider chooser protocol really implement LSP?
Remember I said anything that solves it is fine, so did we come up with something that actually
solves it? So we verified this in two ways. First we wrote a proof and then second what we did
is we implemented all of the different flavors of the decider chooser protocol in Mace, so
having one coordinate per multiple sets of instances, having multiple coordinates distributing
the chooser, et cetera, and we made sure that the invariants that LSP has were held, so
progress everybody eventually gets a coordinate and distinction, this notion of if you share a
parent you can't have the same coordinate, so we made sure that these things held. The next
thing we wanted to check was did we really pull out a reasonable abstraction with LSP or did we
just look at something random? Was this a good place to start? So what we did is a formal
protocol derivation from the very basic decider chooser protocol all the way to ALIAS. What I
mean by this is we started with a very, very basic decider chooser protocol. It's not that many
lines of code. And then we wrote some invariants about what has to be true about the system.
Then we did a series of small mechanical transformations to the code where at each step we
proved that the invariants still held. Ultimately we did the right series of mechanical
transformations such that we ended up with the full distributed version of this operating in
parallel at every level. So this convinced us that this is actually a reasonable abstraction to have
pulled out.
>>: How did you do the mechanical transformations?
>> Meg Walraed-Sullivan: So this is an analysis process, so we took it, for instance, to go from
the non-distributed chooser to the distributed chooser, we formerly said where we would host
each bit of code. We defined queues that would actually later be based on message passing et
cetera, so it's actually like a grab and replace where we made strategic choices about what we
would replace and made sure that we didn't break any of our invariants. The next thing we
wanted to check was feasibility because if we implemented something that is never going to
run on real switches, then it really again, doesn't make sense to use it. It doesn't matter if it is
scalable. So first we looked at overhead in terms of storage and control. Now storage, I don't
mean forwarding tables, I just mean the actual memory on the switch to run the protocol. We
just wanted to make sure that we weren't going to overwhelm these switches. We looked at
our memory requirements and they were quite reasonable for large networks. We looked at
this both analytically and on our test bed. The next thing we looked at was control overhead
and with control overhead we have this trade-off between overhead and agility, so this is based
on both the size of the switches and the frequency with which we exchanged state with our
neighbors. If we have nodes exchanging state very frequently, then of course when something
changes we are going to converge more quickly. On the other hand, we’re going to pay the cost
in having more things exchanged more frequently, more control overhead. What I gave here
was some representative topologies, some different sizes, and what the control overhead might
look like for these. The reason I say 3+ in terms of depth and these 65K plus and so on in terms
of hosts, is because this number doesn't depend on the depth of the tree, so this would work
for any size tree with this size switches. This column here that I've got, this is worst case burst
exchange. This is the absolute worst case that we could ever expect to see. This is if we have
one level one switch acting as the shared relay for every single hypernode in the topology and
managing everything and if we have every single switch come up at once, then this is what it's
going to have to send once for this topology, so this is a pretty worst-case behavior. Then I
have a few different cycle times listed and just showing you what the control overhead would
actually be corresponding to the cycles. And we actually think this is quite reasonable given its
a worst-case and it's, in terms of how much of a 10 G link that it would take up. The next thing
we looked at was convergence time. Is this protocol really practical? What I want you to do for
a second is suspend disbelief about the decider chooser protocol. Let's just say that everybody
magically picks coordinates that work and that there are no conflicts. And let's look at what the
base case convergence for ALIAS would be in this case. So on one hand to measure
convergence time we just measure it on our topology. We perturb something, we start it up
and then we look at the clock. We found that our convergence time was what we expected
from our analytical results, which I've shown here. Now in this case our analytical results only
depend on the depth of the tree, not the size of the switches. And essentially, our convergence
time is going to be based on two trips up and down the tree. This is one to get the level set up
and one to get all of the hypernodes and coordinates set up. Now of course we can probably
expect it to be less than this because there is going to be some interleaving of these cycles that
happens, but just to give you an idea, here are some example-based cases for some topologies
and I've shown you some cycle times that are corresponding to nice control overhead
properties from the previous slide. So now you can un-suspend your disbelief about the
collisions and the decider chooser protocol and let's talk about how bad that is. Remember I
said that the decider chooser protocol was probabilistically fast? What does that mean? Is that
really reasonable? Well, the way that we decided to look at this, we looked at it analytically. It
turns out that there is a very complicated relationship about the coordinate domain, the
thickness of the graph, of the bipartite graph, and so on to do this analytically, so we moved
away from that since we have numbers for that but they don't really convince us of anything,
and we moved to an implementation using the Mace simulator. Now the Mace simulator is a
lot like the model checker, but it actually runs executions in different orders and it picks a little
bit differently about how it represents non-determinism. The beauty of the simulator is it
actually allows you to log things. You can actually say X cycles have passed and this property is
turned to true. What we did was we built three types of graphs. The first was based on a fat
tree, so we took two levels out of the fat tree to represent the decider chooser problem. The
second was a random bipartite graph and the third was a complete bipartite graph. So as you
can imagine a complete bipartite graph is the absolute worst case scenario because we have
many more choosers competing with each other than we would expect, and they are
competing across many deciders so each chooser is going to collide with every other chooser
and is going to be told so by every other decider. The next thing that we varied was the
coordinate domains, so we tried things where the coordinate domain was exactly the minimum
number that it could possibly be for that graph. We tried things where it was a little bit bigger
than it could be, by 1.5 times the number or two times the number of choosers. So these are
still pretty reasonable coordinate domains, because remember in practice we expect instances
of the decider chooser protocol to be these small chunks of the graph. And what we found was
that for reasonable cases this thing converges quite quickly, two or three cycles on average.
This is including the cycles that I mentioned in our base case for establishing connectivity
between the two levels. Now the worst case was the complete bipartite graph with the
smallest coordinate domain possible. And the reason for this was that the simulator decided to
generate some interesting behavior from me, much to my dismay. So at one point we had one
chooser for which every single response from every single decider was lost 89 times, so it took
89 cycles to converge because the chooser couldn't get any information about what its
coordinate could be, but when it finally heard back from all of the deciders and heard that you
can't have any number but X, it took X and it moved on. So of course, this is not a very realistic
scenario, but this is where we see the stragglers when the Mace simulator decided to give us
some sort of crazy behavior.
>>: What do you think the failure rates for the components [inaudible] like what you think the
implications are for [inaudible] protocol?
>> Meg Walraed-Sullivan: I would love to know. Unfortunately I don't have a data center to
look at, but I'm told that actually link failure is a pretty big problem; it's very common. Links
flap, they come back, they fail and I'm told that it is a pretty significant problem. That's why we
designed this protocol both with one of our big constraints in LSP being that we need to be able
to deal with the instances changing and one of our big constraints overall saying that we need
to be able to react quickly and locally. We definitely think that it's a real problem, not
something that's going to be intermittent. Now the scalability.
>>: It looks like you designed the protocol assuming that people would be laying out their
networks in a treelike topology [inaudible] unfair to…
>> Meg Walraed-Sullivan: Good question, good question. So ALIAS will work over any
topology. Of course it's going to be ridiculous if you try to overlay a tree on a ring. So it really
does work best over something that can be hierarchical. Now we do see multi-rooted trees
pretty often in the data center. I know that that is not always the case, especially in this room,
but we do tend to see them. We are told by network operators that that is often the structure,
so we designed a protocol that would work for them. If you have a data center that is
structured in a different way, probably ALIAS is not the right choice. So scalability. Does this
thing really scale to these giant, giant data centers that I mentioned? Well, our first blush at
making sure it was scalable was to just model check large topologies. The reason for this was
that sometimes as I'm sure you all know, when we scale out a protocol weird things happen
that we didn't expect. So we just used the model checker to make sure that nothing changed at
scale and nothing went weird. Then because the model checker only scales to so many nodes,
we wrote some simulation code to analyze network behavior for absolutely enormous
networks, networks that are more realistic in scale. And so what the simulation code did was it
laid out random topologies, and then it set up the hypernodes. It figured out what the
coordinates could be, assigned forwarding state and then looked at the forwarding state in
each router.
>>: How large was the model checker?
>>: The model checker, only up to a couple of hundred nodes, maybe 200. Not enough to
convince us that this thing was scalable. The simulations that we did up to tens of thousands,
sometimes hundreds of thousands, but I don't display the numbers for hundreds of thousands
because we didn't run enough trials to be sure, because it took hours and hours and hours to
run. So here are the results of what we simulated. So let me explain this chart. On the left we
have the number of levels in the tree and the number of ports per switch. And so what we did
is we first started with a perfect fat tree of this size. Then in this percent fully provisioned
column what I did is I failed a portion of this fat tree. Now you may wonder why I failed the fat
tree. The reason for this is that in a perfect fat tree, the hypernodes are going to be perfect.
Everything is going to be aggregated really nicely. What we wanted to test was as we get away
from this nice perfect topology and as we start admitting failures, are the hypernodes still really
doing their job? Are they aggregating, are we still going to get this compact forwarding state
that we wanted? Now the next column shows the number of servers that each topology
supports and what this gives us is a very worst-case comparison to using layer 2 addresses, so
this is not a fair comparison, but it just gives us a base case, the very worst-case scenario. So if
we had layer 2 addresses, we would need this many entries in our forwarding table. Of course
the last column gives us the number of forwarding entries on average in each switch in ALIAS.
So as you can see, orders of magnitude lower than the worst-case at layer 2. Now of course we
could expect to see similar numbers for something like IP if we cleverly portioned the space and
everything, but of course we would have to pay the manual overhead in this case, the manual
configuration cost and here we didn't have to. So what we see is we get forwarding state that
is like IP without paying the cost associated with IP. So to conclude, oh, go ahead.
>>: So did you actually build to order [inaudible] understands peer addresses?
>> Meg Walraed-Sullivan: Yes, we did.
>>: How difficult was that?
>> Meg Walraed-Sullivan: We did it both in Mace and we did it using our test bed that we had
already set up for Portland. Portland assigns addresses that look a lot like this but in a
centralized way, so we actually already had that forwarder setup. Unfortunately, the
performance is pretty hard to measure because we did it with OpenFlow and updating flow
tables is pretty slow and so we were limited in our evaluation by the speed of updating flow
tables which ended up being a limiting factor.
>>: Okay so, I guess I want to bring you back to what I asked you earlier on.
>> Meg Walraed-Sullivan: Sure.
>>: You introduced a lot of complexity around finding noncompliant addresses or labels, I guess
I was, I would be more convinced that you couldn't just used big random labels if you actually
showed that there was something that broke. But it sounds like if you actually built a forwarder
what would happen [inaudible] the border actually fall over if you used the 25 labels, maybe
just used random [inaudible] designs [inaudible]?
>> Meg Walraed-Sullivan: Sure, another idea would just be to carry Mac addresses down the
tree and maybe have someone in the hypernode be the leader whose Mac address got to
represent that hypernode. So this is definitely something that we could consider, a lot less
complexity. The problem is that we do run into these long addresses and what this does is it
makes our entries in our TCAMs for forwarding longer, makes them take up more space, and I
think…
>>: [inaudible] hundreds of them and why does that matter?
>> Meg Walraed-Sullivan: Well, if we are scaling out to the size of the data center, we may
have a lot. We may be able to fit and we may not. So it's already a bit of a known issue that we
are running out of TCAM space and TCAMs are expensive, especially in commodity switches.
Now if we're talking about using bigger switches then this is a nonissue and I would definitely
go with something like what you're saying, but we're trying to buy these cheap commodity
switches and we can't expect them to have a ton of space for forwarding state, and so if we
have this sort of prefix matching business going on and we have a really long labels, we're
actually going to be taking up multiple lines in the TCAM with each entry. So if we can afford
the forwarding state, then I'm all for it, but I think that for the purpose of this work that we are
looking at smaller forwarding state, then we can afford that.
>>: I'm sorry one more. [inaudible] switch obviously because it has a [inaudible] routing
algorithm that's [inaudible] so…
>> Meg Walraed-Sullivan: Right, but then we do have things like OpenFlow they give us the
ability to modify the routing software on the switch.
>>: But then you have a component problem which is that you can't actually--every single
packet now is on the slow path.
>> Meg Walraed-Sullivan: How so?
>>: Or at least every flow is on the slow path.
>> Meg Walraed-Sullivan: Every flow is on the slow path, yes, and there has been work to
addresses and to make things like OpenFlow faster, but if we are going to, we can upgrade the
firmware. We can use something like OpenFlow, so we feel that it's okay to modify the switch
software, probably not the hardware.
>>: Okay. So do you have a graph showing that Jeremy's scheme only achieves this number of
servers and yours achieves--I don't have a sense of how many more times machines you can
get.
>> Meg Walraed-Sullivan: I don't. That's something that I really should do for future work and
that's definitely something that we should talk about later if we can. So to conclude, hopefully
I've convinced you that the scale and complexity of data centers is what makes them interesting
and that the problem of labeling in them is interesting because of these properties. I've shown
you the ALIAS protocol today which does topology discovery and label assignment and it does it
by using a decentralized approach so we avoid this out of band control network. It leverages
this nice hierarchy in the topology even though we have switches and links failing, we still have
this nice hierarchy so it leverages this to form topologically significant labels, and then of course
it eliminates the manual configuration that we would have with something like IP still giving us
scalable forwarding state. So I would be happy to take more questions. [applause].
>>: Question, very interesting talk. How in this work can you incorporate work like load
balance into the network?
>> Meg Walraed-Sullivan: That is a very good question, because what we're doing is we're
really restricting with the label to a set of paths and it's difficult to do something like global load
balancing on top of this. This is one of the topics that we are actively looking at right now, how
do we get this to play nicely with load-balancing. So along those lines, one thing that we are
looking at, so I think it's important to understand that there are actually two levels of kind of
multi-path within ALIAS. The first is picking a label. If you have multiple labels, then each label
corresponds to a set of paths. And then within a label, we still have this ability for multi-path.
So this relates to load-balancing because we have a set of paths that we can possibly load
balance between, and so this is something that we are actively looking at. Can we select the
best label? Can we take a best label, selected and then map other labels to ALIAS, no pun
intended, to this label so that we can use all paths for this label, and this is definitely something
that we are looking at.
>>: And also related question is when you need to incorporate load balance into [inaudible] is it
including, basic includes memory consumption on the switches, meaning basically you need to
record multiple entries onto this [inaudible] because then you can decide which entry to choose
when [inaudible].
>> Meg Walraed-Sullivan: So I'm not sure I see what the question is in that.
>>: So the question is basically in this framework two destinations potentially you have maybe
[inaudible] multiple [inaudible] in your TCAM basically entries do you need to record multiple
[inaudible] or [inaudible] basically just one line?
>> Meg Walraed-Sullivan: Without load-balancing, it's just one line. Because what we need
know is, for local things, we need to know the address that goes through this particular switch
to get to them, and then for faraway things we just need to know their top-level hypernode and
know how we can route up into the tree to get to something that reaches that type level
hypernode, so of course with load-balancing, then this story could change.
>>: So when you [inaudible] few things [inaudible] when you've got 100,000 machines, how
many labels do you wind up with?
>> Meg Walraed-Sullivan: Not too many. Not too many. For example, you can see here that
we end up with each switch actually needing to know, are you talking about how many labels
end up per switch or…
>>: No I meant per for a host, how many labels for a host?
>> Meg Walraed-Sullivan: This depends on how broken the tree is. As you start to break up
hypernodes, as you start to fail more of the tree, then you are breaking into more labels. On
the other hand, as you break the tree further you are actually breaking paths and then the
number of labels go down because there are fewer ways to reach things, so the base case is if
you have a fat tree, you are going to have one label per host. We did lots of different
topologies. Not all of them are realistic because the model checker decides to fail what the
model checker decides to fail. But we didn't see too many labels per host and our topologies
on the model checker which was smaller, a couple of hundred nodes; we saw four or five labels
per host in the worst case. These were really fragmented parts of the tree. I mean the model
checker will generate something that looks like a number of just strings coming down that have
no cross connects.
>>: And then I'm wondering how you imagine integrating this into any operating system so
[inaudible] place where [inaudible] IP addresses are exposed like this is the guy who sends you
a packet and he needs something [inaudible] labels on, is that, am I imagining that correctly?
>> Meg Walraed-Sullivan: So the end, you mean the operating system on the end host? So the
end host doesn't know anything happened, so the end host, because we have in our
communication strategy anyway, because we have this nice kind of proxy ARP things set up, all
we need is a unique ID for a node and we resolve that to an ALIAS label, forward through the
network and then rewrite back into what that unique ID was, so you can imagine a scheme
where we did this on IP addresses. Although then you have this overhead with assigning them.
My example is with Mac addresses but in fact you could do this with anything. In fact, I think
the interesting thing here is that you could do some sort of any cast type thing if you were
willing to modify the end host, so you could send a packet to printer and then your ingress
switch could go ahead and resolve printer to a number of ALIAS labels, maybe corresponding to
different nodes, and then pick one of those and send towards it based on whatever constraints.
>>: So where does the transition happen?
>> Meg Walraed-Sullivan: Where does the transition happen?
>>: The translation from IP or Mac address to…
>> Meg Walraed-Sullivan: At the bottom level switch where we are connected to an end host.
>>: So that switch, how does it, it keeps all of the labels?
>> Meg Walraed-Sullivan: No, we did a proxy ARP mechanism, so what we do is because we
have to pass things up in the tree anyway in terms of setting up hypernodes and so on, we
actually pass mappings between ALIAS labels and whatever our UIDs are going to be, in this
case Mac addresses, up the tree, and so the roots of the tree all know the mappings and so
what we do is we have the level one switch ask the roots of the tree, so we have it send
something upwards like an ARP request. It actually intercepts real ARP requests from the host,
sends these upwards and gets an answer back down. So there are a number of ways that we
could write this and it's just some kind of replacement for ARP.
>>: Could you encode a [inaudible] label as a [inaudible]?
>> Meg Walraed-Sullivan: I'm sorry.
>>: Could you turn the label for an end host, translate it in some way directly to a [inaudible]
like a [inaudible] address [inaudible]?
>> Meg Walraed-Sullivan: I would think so. This isn't something I considered, but I think that as
long as it fit in the space that we were trying to use and as long as it didn't have any sort of
conflicts. If this were visible externally, then this would be a problem. I don't see why not.
That being said, I reserve the right to see a reason not to in the future.
>> John (JD) Douceur: Anyone else? All right, let's thank the speaker again. [applause].
>> Meg Walraed-Sullivan: Thank you.
Download