in the past few weeks. And actually Victor's intro was perfect because this
particular piece of work was borne out of some issues that grappled with when I
was at AT&T, and I had been wondering about them for a long time, and I
hooked up at Princeton with Mung Chiang who is an optimization theory guy in
our electrical engineering department and this work is done in conjunction with
him looking at some theoretical techniques that he has for creating network
protocols that are implicitly solutions to optimization problems.
And so what I want to talk about first, though, is in the networking research
community the past few years has been a kind of buzz word that if you're in the
networking community you've heard more times than you can count about clean
slate network architecture. And I wanted to dwell on this for a moment because
this talk is really about that topic. So what does that mean?
So network architecture is more than tuning a particular protocol in the Internet or
tweaking its performance. It's really about the basic definition and placement of
function, what things should be done by the routers and switches, what should be
done by the end host, what should be done by automated management systems
or human operators?
So I want to in this talk revisit the traditional definition placement of function for a
particular network task that I'm going to call traffic management. So that's what I
mean by network architecture.
So what does clean slate network architecture mean? Well, it means thinking
about this problem without the constraints of today's artifacts. So what does that
mean? It doesn't mean ignoring the speed of light or cost of memory, but it
means not worrying about some niggly little detail in some existing IATF standard
that often, when I was at AT&T, would constrain the way I was allowed to think
about a problem if I wanted it to be used in practice.
So why is that useful? Well, part of the goal is to build us a stronger intellectual
foundation for designing, deploying and managing networks by being able to
think from scratch about how protocols should be designed and used.
And I there's some belief that -- to make the Internet better we may even have to
change the Internet architecture in somewhat fundamental ways and so there's
still an open question about whether this is more than just an intellectual exercise
but might in some cases actually help us make the Internet better in a more
practical way.
So that's all well and good and that people have been talking about this for a
while. But one of the things that I found perplexing about it is not really care how
one does this, how does one do? Okay, I decided I wanted to be a clean slate
network architect. What do I do? What do I do when I start my day?
And so I've been talking a lot with Mung Chiang about what to do, and he has a -he and a number of others in his area, Frank Kelly, Steven Low, et cetera, have a
real interesting body of techniques for viewing distributed algorithms, protocols,
simplistic solutions to optimization problems. And these are algorithms that in
fact the math tells us how to derive.
And so you could do this talk really two ways, and I'm going to do it I'm going to
do it primarily the first way, which is I'm going to walk you through the design
process that we went through in solving a problem that I had worked on from a
much more bottom up perspective for a number of years when I was at AT&T. If
you're someone that likes the journey more than the outcome, this is a talk for
you. If you real are interested in the outcome, I will get to that, but it's going to be
somewhat superficially treated because I'm real trying to walk you more through
a design process than I am preoccupy with the particular outcome we get.
So we're going to look at traffic management. So what do I mean by that? I
mean that what is it that allows us to compute for each path through the network,
between every pair of nodes, what bit rate I'm going to send traffic on that path.
And so I define it that way because it encompasses most of the major resource
allocation issues that the networking community things about, routing, the
computing of the paths, congestion control, the adapting of the sending rates of
the sources, traffic engineering, the tuning of routing by the management
systems to decide which paths that traffic should go on. And so what I'm going to
talk about is going to encompass those three topics but it's going to revisit what
division of labor we should have between the routers, the end host and the
network operators in solving that problem.
So what I want to look at this? It's a pretty broad topic that is pretty basic in
networking. But frankly, more importantly, it's at least something I think there's
some hope in getting traction on mathematically. And so it seemed like a good
place for us to start our activity. And why can we get traction on this
mathematically? Well there's been really lovely work by Steven Low and Frank
Kelly on reverse engineering and TCP congestion control that shows it's implicitly
maximizing user utility, sum over all the users on the Internet.
People have recently made some great progress of doing forward engineering
using these same techniques, designing new variants of TCP, like TCP fast in
Steven Low's group. And also people have used optimization theory to tune the
existing protocols, the existing router, the existing routing protocols. And in fact
work I was involved in at AT&T took this approach. We assumed the protocols
are given, but they offer us knobs that we can use optimization theory to tune.
So important problem. Pretty broad in networking and some at least hope that
will make progress thing mathematically. And yet a problem that's not solved
enough that it's a waste of time for us to work on it, it's still an area where there's
not really a good wholistic view. And part of the recently for that is traffic
management in the Internet arrived, came -- we arrived at it fairly in an ad hoc
fashion, which is true for most protocols in the Internet. So what do we do
today? Well, the routers, they talk to one another and they compute paths
through the network dynamically based on the current topology. The end hosts,
we're in congestion control algorithms that increase and decrease their sending
rates in response to congestion on the paths they're using, and the network
administrator is sitting on high or automated systems acting in his behalf to tinker
under the hood with the configuration of the individual routers to coax them into
computing different paths so that when the users do their thing, the traffic flows
more effectively through the network.
And I as I mentioned before, this evolved really organically over time without
much conscious design. So why is that? Well, you always needed the routers to
compute paths. That's been pretty basic from the very beginning. Congestion
control is added to the Internet in the late 80s when there are starting to be fears
of congestion collapse as the Internet became more popular. And traffic
engineering became important in this sort of early to mid '90s when the Internet
became commercial and ISP started to see significant growth in Internet traffic
with the emergence of the web and had commercial motivations to have
performance be good and the links to be used efficiently.
So these three things are running completely independently. The vendors and
the IATF dictate the routing protocols, the IATF and the end host dictate
congestion control, and the network administrators in each ISP in the dictate the
traffic engineering.
So as you might expect, this works, using the network and probably several of
you right now in fact, but it doesn't necessarily work well. And so what's wrong
with that? Well, the interaction between these protocols and practices are not
captured. Congestion control assumes routing isn't changing. Traffic
engineering assumes the offered load is inelastic when in fact we know it's
fluctuating in response to congestion. Traffic engineering itself, given today's
routing protocols is complicated. It's and NP-hard problem to tune the link
weights that the routers use to compute shortest paths, to do traffic engineering.
And every time and adjustment is made in the network, the routers spend a bit of
time talking amongst themselves, leading to transient disruption. So people tend
not to adapt the routing of the traffic on all that fine a time scale out of fear of
causing transient disruptions.
And finally, even though the topologies have multiple paths, they're used only in
very primitive ways. Most of the Internet protocols make very limited at best
equal cost multi path use of multiple paths.
So in light of all these things, the question that mung and I were curious about is
if you could start over knowing all these things we know now, what would you do
differently. And that's essentially what this talk is about. And I'm going to be
fairly deliberative about process here because the process we went through for
designing the protocols really I think the part that's more interesting than the
protocol itself. So first thing we're going to do is formulate the problem we think
the protocol should be solving. Sounds basic but you'd be amazed how many
protocols in the Internet we still don't know what problem they're actually trying to
solve, and we're not even sure they solve them.
So we're going to formulate a problem. I'm going to try inspiration from today's
traffic engineering practices and congestion control protocols. I'm going to use
mung's magic optimization theory stuff to derive distributed algorithms that are
provably table and optimal realizations of that optimization problem. So that's
where Mung is very happy but I'm not because the theory only tells us that the
thing is table and optimal, not how quickly it's going to converge or how sensitive
the protocol is to tuneable parameters. So I'm not happy. So we're going to run
a bunch of simulations to understand how well these in fact four protocols at all
are provably stable and optimal. Because some are better than others in
practice. Simulations will tell us that some are better than others and in fact we
don't like any of them. But each of them have some ingredients that we'd like to
cherry pick.
And so I'm going to use a little bit of human intuition to pull the best features of
each of these protocols Apple design a new protocol that is not provably table or
optimal or it is only provably stable and optimal under a narrower set of
conditions. So it's not and optimization decomposition, although we can at least
say some things about it. But it actually is much simpler and much better in
So you know when you watch a movie and it says it was inspired by real events,
this is the place in the talk where that happens. Where the theory inspired us to
do what we did, but we started to deviate from the theory.
Finally even that algorithm is extremely abstract. Traffic is fluid, feedback delays
are constant. So we're going to translate that into a packet based protocol that
we can actually simulate in detail on a packet level simulator and make sure that
the properties of the protocol that we think we've inherited from the theory are still
there when it becomes a real protocol. That's the story of the talk.
Okay. So the most important thing that we're going to do is the beginning which
is what problem do we want this protocol to solve? And so just talk a moment to
look at what TCP does and what network operators do to get inspiration for what
things we'd like the combined system to do. So TCP is a bunch of source
destination pairs in the Internet and index them by I, and they have implicitly
solve as shown by Frank Kelly and Steven low and others a utility maximization
problem. So essentially it solving in a distributed fashion aggregating the
aggregate utility subject to the fact that the routes combined the traffic that goes
on those routes, has to make sure no links are above capacity. And the variable
is the sending rate that the source is sending its traffic at.
So all the different variants at TCP have a slightly different definition of utility but
they all have the same basic property that as you send more you're happier, but
they're diminishing returns. And there are a whole bunch of families of off of
fairness definitions that come from economics, there's sort of log utility functions.
All the different TCP major TCP variants have been traced back to some
definition of the utility function that looks like this. And essentially what that utility
function represents is some notion of user satisfaction where more is better but
it's real important to get moral, you don't have much to begin with. And in fact
these applications we run on the Internet are elastic. If we can take advantage of
the extra bandwidth when it's available, when we can still function when it's not.
Okay. So that's the problem TCP solves. Yeah?
>>: Are you looking at throughput as the metric that you care about in some
>> Jennifer Rexford: Yeah. That's a great question. Yes. And in fact as a
separate work that we did as a follow on to this, we looked at delay instead and
tried to have a minimization of aggregate delay. And we end up with a different
protocol but with it it goes through exactly the same design process to get at that
>>: [inaudible] users, right? How does that ->> Jennifer Rexford: Yeah. And then we looked at what do you do if you have a
weighted sum? And essentially the same theory can be used to help us solve
that problem, too. So I'll touch on that really briefly at the end. But you could
view this as just an example. And you're totally right, we chose to throughput
here, and if you opportunity, you'd end up with a completely different protocol
with similar structure but the details would be totally different.
Any other questions?
So that is the con investigation control piece. So what do network operators do
today? Well, they look at their network. And they don't want the links to be
congested. And so they're trying to minimize some sum of congestion over all
the links. And typically that congestion function looks something like a queuing
delay formula, where they're pretty happy as long as the links are below 30, 40,
50 percent utilized and get increasingly unhappy when the link approaches
capacity or exceeds it. And they need to do that in such a way that the links
utilization is a sum of all the traffic imposed on routes that traverse that link
divided by the capacity.
So unlike in the previous problem where the source rate was our variable, here
it's a given. If the traffic has been measured it's the offered traffic as far as this
network knows. And in fact the routing is the variable. Whereas in the other
problem, the reverse was true. Okay? So the goal here then is that it is to have
a cost function that's the penalty for approaching link capacity. So why do we
care with this? Because the network is not terribly robust when its links are
operating very near capacity. A small burst of arrival of new traffic from a new
source will cause the network to go over a cliff and drop tons of packets on the
And so while the user is driving the network into being as heavily loaded as
possible by sending as much as you can, the network operator is saying whoa
and holding back, trying to keep the links from being over[inaudible]. Yeah?
>>: [inaudible] same one from the previous slide then in that [inaudible] right?
>> Jennifer Rexford: You want what, I'm sorry?
>>: [inaudible] you want to see L, and from the previous slide then you want at
most one, right?
>> Jennifer Rexford: In theory, yes, but if your routing has mismatched the
traffic, you might have utilization over capacity.
>>: So [inaudible] explain quickly what is the meaning for CL and CY?
>> Jennifer Rexford: So CL is the capacity of link L, and RLIXI is the load put on
link L by just that one flow that's traversing link L. And then I'm going to sum up
over all flows that traverse link L. And in theory, it can't be above 100 percent,
but I'm going to allow it to be over 100 percent. You could sort of think of the
parts above 100 percent and it's getting dropped on the floor because the link
can't carry them. And that's why I allowed this function to shoot up to near
infinity, but to still have a meaning what is in access of 1.
>>: So [inaudible].
>> Jennifer Rexford: So L is a link, and it is every flow I will traverse a sequence
of links if RLI is one, that link is on the path, and if it's zero, that link isn't on the
>>: So your sum over all I but only if those relate to L?
>> Jennifer Rexford: Well, exactly. But you could think of RL easement as 0 for
the links that don't -- that are not on the path traversed by L.
>>: By I?
>> Jennifer Rexford: By I. Exactly. And if it's multipath that LRI might be 50-50
you know on the two paths for example that carry traffic for flow I. Yeah.
So that's the problem that the operator is solving. And this essentially avoids
bottlenecks in the network. Yeah.
>>: They are also optimizing for latency and a couple other factors, right, or is
that cost?
>> Jennifer Rexford: So that's true. We don't capture that here, but you're right it
kind of implicitly gets captured in the sense that if you take really circuitous paths
it imposes load on so many links that you tend to be biased towards paths that
don't have high delay. But you're right, we're not explicitly capturing that here.
But you could imagine extending that the problem to have it.
>>: [inaudible] first order [inaudible] and the second order is latency, then the
third order might be cost or [inaudible].
>> Jennifer Rexford: Right. So here we're just cap capturing the congestion
piece but you're right the delay will matter to you. Yeah.
>>: [inaudible] essentially can [inaudible] in terms of looking at the average ->> Jennifer Rexford: Right. So the limitation of the theory is that we're thinking
really of everything as a fluid so that there isn't burstiness. But you could view
indirectly that we're capturing the burstiness by caring about this, because if -- in
theory there's no reason we couldn't run the link almost at 100 utilization but our
runs to do so is based on the fact that that's going to be bad when traffic is
bursty. So the theory didn't capture it but we're including this piece in the
problem because we're worried about it.
And these are exactly the issues that, you know, we're going to end up with a
protocol that we think we could say should go about. But in practice we don't real
know how it performs when the traffic is bursty, we just have a hunch that we've
designed it with those constraints in mind and when we simulate we'll capture the
actual burstiness of the traffic. Yeah. Good question.
So then what do we do? So essentially what we do is we combine these two
objectives together. We say we want to maximize user utility minus the
congestion penalty that we're composing -- imposing on the network with some
weighted -- weight between the two.
So one way to think about that is we're trying to balance the goals of the end host
to maximize throughput which tends to push the network towards being
bottlenecked while being aware of the network operator's desire to minimize
queuing delay and avoid congestion in the network.
And if W is zero we're essentially doing very much the same objective function
that TCP has. If W is very large, we're essentially just doing traffic engineering
and being credibly conservative about what traffic we even let in the network in
the first place. And we'll see that having W equal to 0 is a fairly fragile place to
be, but having W just a tiny bit larger than 0 is actually the sweet spot for us. But
we almost maximize throughput without making the network quite so fragile as it
would be if that was all we cared about.
So I'll come back to this penalty weight again a little later when we do the
numerical experiments.
Okay. So what do we do? Now we've got this optimization problem that we
formulated, and now we need to find a distributed protocol that solves that
problem. So we're going to use the mathematical techniques that Mung Chiang
and his brethren have come up with. But one problem we're going to run into is
there's a sort of watershed in the optimization community about problems that
are convex and problems that aren't. And we don't really know how to deal with
the ones that aren't terribly well.
And so the first thing I'm going to do is I'm going to force the network to make the
problem that Mung solves be convex. So this is the dynamic one that I have, I
tell him a problem, he says that problem's really hard, I'm going to work on it, he
says it's cool, I say no, it's hard, that's my problem, I'm going to make it easy.
And so my job is to make Mung's life easy and his is to make mine hard.
So convex problems are great because there's a local minimum that's also the
global minimum, you can use gradient techniques to find it, and those are
amenable to distributed implementation. Non convex problems don't have that
property can easily get stuck in a local minimum, and so you have a hard time
making a distributed algorithm that will find the global minimum like that. Okay.
So we want to have convex problems. We know how to handle them. And we
know how to drive distributed solutions in fact that are computationally simple
that provably converge.
Okay. So why is our problem not convex the way I've talked about it so far.
Well, the main reason is I'm assuming single path routing, or at least I haven't
stated what those routes are. Getting back to the question that you asked a
moment ago, what is this R. It's the set of links that flow I converses in the
network. And it does restrict it some way, like a hundred percent of the traffic
has to go on one path or a hundred percent on the other as would be the case
with today's single path routing. Or even if I use ECMP that I require only even
divisions of 1 over N across the paths I don't have a convex problem because I
have these kind of weird constraints imposed by the way I can split over multiple
So I'm going to rephrase my problem in terms of multipath routing where I'm
going to assume now that I have a set of paths between every ingress egress
pair. And this gets back I think a question you asked about -- somebody asked
about delay. Yeah, it was you. So I'm going to pick these paths smartly so that I
pick paths that don't have high delay. Okay. So I'm going to introduce a little
practical stuff under the hood here to pick not all possible exponential set of
paths but just a few that are reasonably disjoint and don't have real huge delay.
And them I'm going to now say the problem I'm actually solving is to maximize
utility summed over the utility I get across all the rate, aggregate rate that I'm
sending where that rate is split over multiple paths such that the sum of the link
loads again, the link loads individually can't be more than their capacity.
So essentially I've just converted what was originally a single sending rate
question to now a multi rate sending problem of how much I'm going to send on
each of multiple paths and I'm going to assume I can do this in any disk
completely arbitrarily fine. It could be 50-50, it could be 49-51 if I need it to be.
And that's possible today using a variety of techniques that help multiple packets
of the same flow stay on the same path using hashing techniques program to do
pseudo random splitting rather than truly arbitrary weighted random splitting.
>>: So because the convexity do you have to pick the path carefully or ->> Jennifer Rexford: No. No. The main thing that I need is the splitting. Yeah,
really good question. I don't actually really care. The main thing I'm getting out
of how I pick the paths is because I'm not choosing to represent every possible
path on the graph, I'm stepping away from optimal it, because I originally was
allowing any path potentially to be used. But now I'm essentially not allowing the
path 987321546, and it could have been that the optimal solution would have
used it and I'm not. So it's more that I need to pick a reasonable set of paths that
the optimal solution doesn't lie far away from what I'm letting get expressed here.
But the good thing is like picking the K shortest paths or the K shortest paths that
are at least somewhat disjoint from one another seems to work pretty well. Most
of the graphs I've looked at. I know data centers probably don't look like this but
don't have so much path diversity that that's hard to do. And the data center that
might not actually be the case.
>>: So you're assuming you're varying the rate long each of the paths so one
path might [inaudible] traffic at a greater rate than another?
>> Jennifer Rexford: Exactly. So you could think of that two ways. You could
think that the component that's sitting here, which could be an end host or an end
switch either is computing those rates and sending at them or is computing an
aggregate rate so that the host is told send at whatever the sum turns out to be,
or a network is just doing proportional splitting based on the relative ratio
between those waits. Or you could think of a single component doing both those
functions. Yeah. But the user utility is defined just in terms of the aggregate
because it's sort of oblivious to which one it's using.
>>: The [inaudible] because I didn't get that. Because in one case you have
[inaudible] just choosing the path and now you're saying [inaudible] being integer
I'm going to use some point to find a split, right?
>> Jennifer Rexford: Right. So one way to think about it is the way I've
formulated the problem here before I changed the problem I was requiring this to
be 100 or 010 or 001. And so I actually have discontinuities that are introduced
by that constraint which essentially leave me with situations that are a bit like -- a
bit like this picture on the right. Now I got continuity so I can make progress by
just imperceptibly shifting a portion of traffic from one path to another. And all the
solutions in between are feasible ones in my model, whereas in the earlier model
I was forcing things to be discontinuous. I could only have integer solutions in a
>>: Right.
>> Jennifer Rexford: In the -- before I made this change.
>>: So just the fact that you move from an integer to just a linear program.
>> Jennifer Rexford: Exactly.
>>: Solution is what makes it ->> Jennifer Rexford: Exactly. Yeah. Exactly. Just not the set of paths, it's
purely the fact that I now can have a continuous solution rather than a discreet
>>: It's basically going from [inaudible] linear program to standard [inaudible].
>> Jennifer Rexford: Exactly. So essentially I'm in the realm of multi-commodity
flow now. Exactly. It's the same reason MP elas can achieve optimal
engineering but OSPF can't. Because the ability to do arbitrary splitting is the
key that's keeping the optimization problem from being tractable and optimal.
>>: [inaudible].
>> Jennifer Rexford: I'm only restricting myself to that for computational
simplicity and in the sense that I don't really want this node in practice to have to
deal with exponential number of paths because I'm going to keep state in the end
host and router in proportion to that. So it's purely -- the theory didn't really
require me to be so restrictive. It's more that I was think in practice not realistic
to enumerate all the paths. And I also don't think it helps much beyond a certain
point. So there's no reason I couldn't conceivably do that, it just didn't make
practical sense. Yeah?
>>: If you are -- if you were to fix paths up front, even fixing one path [inaudible] I
mean what you mean by single path routing is letting the problem choose the
route. That's what's not ->> Jennifer Rexford: Even in this setting if I picked three routes and I said pick
one of the three, I'm still going to have a problem.
>>: Yeah. But if you just pick, you say, you know, you have to go on this path
what [inaudible].
>> Jennifer Rexford: Yeah. That's right. Exactly. If there was only one path
here I would be fine, as long as I didn't allow -- it's more that if I have multiple and
I restrict you to using only one of them. You're totally right. And in fact, we'll see
in practice most of the solutions put all the traffic on one path. There are just a
few node pairs that often in practice take advantage of the need to split. So part
of it is a computational problem I have to get through to be able to get the math
to work.
And part of it is that occasionally there really is a part and a place in the network
where you need that flexibility for efficiency reasons. Yeah?
>>: Well, [inaudible] if you are -- if there is just one edge, just two nodes.
>> Jennifer Rexford: Right.
>>: If your [inaudible] complicated, you can still of a number [inaudible] right?
[inaudible] if.
>> Jennifer Rexford: In terms of the number of paths you could.
>>: No.
>> Jennifer Rexford: Sorry, I didn't ->>: Just if one path, one edge. But your functions [inaudible] sophisticated.
>> Jennifer Rexford: Yeah. Yeah. Yeah, that's fair.
>>: Then it's still ->> Jennifer Rexford: Well, the nice thing here is my U function is concave and
my F function is convex. Because essentially I've got this diminishing return
function utility increases but diminishes, and if you think of it like an MM1Q, it's
essentially increasing and convex. And that's what's letting us make that
In fact what this is fixing is the constraint. I had a non convex constraint that I'm
fixing by moving to this flexible splitting over multiple paths. I already had a
convex objective function before I started because U function and the F function
have property that I need to make that true. Now, as you can imagine if I try to
handle other classes of [inaudible] because utility functions are not as friendly as
aggregate throughput is. That might not be true anymore. But for throughput it
is. And for certain delay formulations like minimizing aggregate delay, that's also
true. It's not true for, you know, delay bounds where you actually have a sharp
discontinuity in the curve. So our delay work that was asked about earlier, we
actually assume that we're trying to minimize aggregate delay rather than
maximizing the percent of traffic that satisfies a delay bound which actually
wouldn't fit in this kind of framework.
Okay. So here is our -- here is our new problem now. So what do we do. So
what are -- the distributed solutions that pop out of the math are going to have
this basic flavor to it. The links are computing -- the routers have multiple paths.
They may be set up in advance. They may be computed dynamically, or they
may be set up in advance by the network administrator. They're going to monitor
the load on their incident links and compute a so called price that's going to be
reflective of the congestion on the link.
They're going to either feed that back to the ingress node which could be a router
or a host or they'll be picked up as the packets flow through the network similar
for folks that are familiar with ATM network, similar to the RM cell that ATM
networks had. And that source based on all of these prices is going to update its
path rate so it's going to both change the aggregate rate at which it's sending and
the percentage of that traffic it sends on each of the K paths.
The network administrator is tuning some parameters that get tuned on the very,
very coarse time scale and otherwise can really take the rest of the day off.
Okay. The network is doing most of the work here.
These other parameters which I cryptically refer to here are going to be some
tuning parameters that pop out of the math, and that you are going to be the
bane of our existence and it's going to be why the so call optimal protocols are
not going to be the final answer to the question.
Okay. So what do we do? So I'm not going to go into a lot of detail about exactly
the theoretical techniques, I'll just say that the decomposition techniques we use
are pretty standard. You know, there's a little bit at work to turn the crank and
make them pop out, but we didn't really do any significant innovation there.
We're just going to try to give you a flavor for what that process looks like. So
essentially what we have are prices that are penalties for getting close to
violating a constraint like the capacity constraint on each link. We're going to
punish a link that's very, very close to overloading its lunch.
The path rates are going to be updated based on those penalties to shy away
from paths that are congested. And so you can get an example of this in a single
path setting. Thing about what TCP congestion control does, which can be
expressed in a similar kind of way, just it's a little simpler than the problem that
we have. The link prices are things like packet loss, or packet end to end delay.
And the sources are updating additively increasing or multiplicatively decreasing
based on the prices they get back from the network
In the case of TCP those prices are implicit and the observations, the sources
themselves make about loss and delay. In our case, they're going to be explicit
and provided by the network by marking the markets or by providing feedback to
the end hose.
So our problem is going to be very similar in spirit, it's just more complicated
because our objective has two terms in it rather than just aggregate utility, and
we're using multiple paths instead of one.
But the basic technique is pretty much the same. And so just give you an
illustration of what kind of transformation we'll make at the problem in order to
make it distributed. So one constraint we have and it's to really the main one is
we have to stay below link capacity. Okay? So we're going to do things like this.
We're going to say okay, let's suppose the link load is YL. I want to make sure
YL stays below C. And one way I can do that would be to do some sort of
subgradient feedback update where I essentially look at the link load and see
how I'm doing. And in I'm in fact at a very high load, I'll actually increase the
price because the gap, if link load is higher than RL, the target load I'm trying to
get at, I'll actually increase the price and that will make the path that this link
traverses -- I'm sorry, the link, the path that's traversing this link look less
attractive, which will ultimately lead the source to send less track on it.
So here we've already introduced one of these so called other parameters that
I'm going to come back to later, so the step size is going to be something the
network operator has to tune. So that's unfortunate. And we'll see later it
actually is -- it is problematic in practice as well.
But that's the kind of thing we're doing. We're taking each of the constraints and
now you can see this is decomposed in the sense that a link alone can do this
computations by itself. So each of the links in the network are doing a simple
kind of update like this. And other parts of the problem, other constraints and
other parts of our objective function get decomposed in similar ways into things
either the edge node is able to do on its own or the links are able to do on their
own. Yeah?
>>: [inaudible].
>> Jennifer Rexford: So in fact I didn't go through this here, but we have -- you
can end up now decomposing, making link load match YL into its own update.
It's a made up variable. So it's made up. It's made up. So another constraint I
now have is link load equals YL. That I can solve by saying YL minus link load
has to be zero. And I can measure how well I'm doing.
>>: [inaudible].
>> Jennifer Rexford: Has no real meaning, it's artificial construction just to allow
me to decompose the problem. And it's just one way of decomposing the
problem. And in fact, we use four different decomposition techniques to vary in
terms of which concentrate they go after first and in what way they decompose it.
But they're all pretty standard ways of taking linear constraints and decomposing
them into updates. Okay. So just to give -- so essentially another example, this
we call defective capacity, essentially gives us an early warning of impending
congestion. We're doing something to get a sense of how much load the link
should have. We're noticing that we're doing worse than that, and so we're going
to panic a bit by penalizing traffic from using this link.
And there's another parameter that allows us to overshoot the actually load on
the link to go a little faster, to send more traffic than the network can handle in
order to converge more quickly. And I'm not going to go into that. But essentially
each of the decompositions we do of a very specific physical meaning in helping
the links or the sources understand the congestion state that they're in based on
local computations they can do.
So skipping some of the details because again, as I mentioned, my goal here is
to convey the journey more than the outcome, essentially we end up with four
different decompositions. We take all the techniques the optimization community
has come up with for slicing and dicing these kind of complex optimization
problems, and they all have the same flavor. The links update a price, the
sources update their splitting rates based on that. And they all just differ in which
order they try to address each of the constraints in the problem.
And so we end up with four different algorithms with different number of tuning
parameters and different degrees of aggressiveness in responding to congestion.
So some of them are essentially going to have different dynamics than others.
They're all provably optimal. They're all provably stable. They are by definition
by the way they were derived distributed algorithms for solving the optimization
problem where the components doing the computation of a links and the edge
nodes. Okay. So that's all well and good. The theory tells us that we're stable
and optimal. But we're real not done. Because we don't actually know how they
perform in practice. And in fact, practice is going to in fact deviate quite a bit
from the theory in some of the cases. So again we know they're going to
converge. But they're only going to converge for diminishing step size. They're
only going to converge on some time scale that we can have extremely,
extremely weak balance on, not strong enough balance for us to be happy that
they're going to perform well in practice.
And these tuneable parameters, beyond telling us that they have to be
diminishing with time which isn't real practical, we have no guidance from the
theory on how to tune. Okay? So we have something that is provably optimal
and stable, but still not very useful.
So what we did is because at this point our so called protocols are really
numerical computations, right, they're not even algorithm -- they're not even
distributed algorithms in any really meaningful sense, they're real just function
evaluations that we're doing. So essentially we went into MATLAB and we did a
huge multi factor experiment where we varied every parameter and all these
protocols and made a ton of graphs to study the rate of convergence and the
sensitivity of each protocol to its tuning parameters. So the first thing we looked
at is how much do we care about the relative value of the user's utility function
when it's zero is all we care about, versus being robust by being a little bit
conservative in what we put on the network.
So Y axis equal 100 means we've maximized aggregate utility. As it starts to
drop off we're being so conservative that conservatism is costing the users
performance problems.
What we find in a lot of our experiments is if we operate at W equals zero, the
system is extremely fragile, hard to make stable, but fortunately we've got a
relatively wideband of values at W where we get pretty close to optimal behavior
despite not being exactly optimal. So we can make W equal to half or 6 or
something and still be pretty much in a regime where we're going to be able to
nearly maximize user utility. And you'll find you can be a much, much more
robust by making that tradeoff. And these are three different network topologies
that we studied that on. So that's W. So we have a sense of how to set that it's
topology dependent.
It relates a bit to the number of diverse paths in the network. But for the most
part we can still as long as we're W a half or so, we're not in bad shape.
So these are the kind of graphs we looked at for a whole bunch of the different
protocols. So protocol might have a particular tuneable parameter like a step
size. We'll sweep its value and we'll look at how long it takes to converge with an
epsilon of optimal. And the Y axis is the number of iterations. Think of them as
round trip time. How many round trip times do we have to keep iterating and
updating before we converge.
And so in a graph like this we would look at it and say, well, we converge to near
optimal within a few dozen iterations. That's not bad. And we look at the width of
these curves to get a sense of how sensitive it is to our parameters, so if we end
up setting the step size over here, we're in big trouble because it's going to take
300 round trip times to converge. So we're reasonably happy that this could be
worse, it could be narrower, but it's still not great if we don't get the value right
we're not going to converge all that quickly. And so we compared all these
protocols based on graphs like this. And what we ended up with was seeing that
small values of W below sort of a sixth to a half are pretty dangerous but once
you had a large value of W, things are pretty good. And the schemes that had
larger number of tuning parameters really didn't do better, even if you could find
the best tuning parameter to use for them, they didn't really converge must faster
than the simpler schemes.
As you might expect, schemes that allow a little bit of more aggressive
transmission overshooting occasionally converge more quickly, and in some
sense doing direct updates from current observations of network conditions was
more effective than contracting past history. Okay. And so we took away from
this that all of the protocols, all four of the -- well, distributed algorithms is
probably more appropriate at this point, had some interesting flavor to them, but
they each were somewhat unsatisfying in one respect or another.
And so we took away these very basic observations and we cherry picked
aspects of the computations that each of these algorithms are doing to try to take
the direct update rules and the techniques that had fewer tuneable parameters,
and we constructed new protocol that's simpler.
So just to show you what that looks like, essentially what we do on every link is
we keep track of the capacity, the link, and how much the load is either in excess
or underneath that, and we update the loss price, if you will, based on that.
So think of the loss price as saying, well, if link load is higher than link capacity,
I'm going to lose the traffic that's in excess of the link capacity. So logically you
could think of this P price as the fraction of traffic that's going to get lost. And the
other price as essentially the increasing queuing delay, this is a F prime, so it's
the derivative of that F function that looks like queuing delay. So this is a little bit
like the link measuring how much queuing delay has been increasing in the
recent past. And each of these are equations that appear in various of the other
protocols we derive theoretically, so that's where we got them from. And
essentially we accumulate a price along the path by summing those up. And so
the source looks at each path in terms of the sum of this loss and delay base
price list.
Note that versions of TCP do something kind of similar to this, they just do it
implicitly, right. They look at loss and they look at delay. We're not so different in
the end in doing that as well, although the details are a little different. Yeah?
>>: [inaudible].
>> Jennifer Rexford: So we do the computation roughly in sort of the ramp of
time, time scale. You could do it -- so you imagine feeding back stuff in the
packets themselves as they're flowing.
>>: But isn't that [inaudible].
>> Jennifer Rexford: It could be. I mean, you could choose to do it more slowly.
It's just if you want to converge more quickly when things change you might
choose to do it more quickly.
>>: [inaudible] to the close capacity and TCP tile to react and [inaudible] right.
>> Jennifer Rexford: Right. So what we have going for us that TCP doesn't is
feedback here is explicit. And so we know immediately what the congestion
state of a link is, rather than waiting for package to get dropped before we know.
So we're getting in some sense a much earlier warning. In exchange, in fairness,
for the overhead of making that information explicitly available. Yeah?
>>: But are you multiplying many, many TCPs into this network?
>> Jennifer Rexford: Yeah. Yes, you could think of this sort of flow I as really all
flows between particular ingress egress pair.
>>: So you are reacting, like Victor said you have little TCPs [inaudible] reacting
->> Jennifer Rexford: There is no TCP here.
>>: Yeah, but I thought you were multiplying many flows.
>> Jennifer Rexford: Yeah, but they're not doing TCP.
>>: Okay.
>> Jennifer Rexford: Yeah. They're being -- they're being told to send at the rate
that we're telling them to send at. Yeah. So, yeah, that would interact in ways
that might be hard to predict. But, yeah, we're essentially telling -- you could
think of us as telling the ingress router how to shape the flows that come in, or
you could think of it as the ingress router telling -- or the ingress router being the
host who is being explicitly told the rate at which he should send. Yeah. And
that's actually critical. We get our faster adaptation than TCP because we're
being -- we're being so heavy-handed. Yeah?
>>: How do you compare this to MPLS's auto bandwidth. How would that
compare to the [inaudible].
>> Jennifer Rexford: So my sense of MPLS auto bandwidth, for folks who don't
know, the basic idea in auto bandwidth is if you encounter congestion you can
essentially dynamically set up a new path.
>>: [inaudible] dynamically grow the bandwidth of an LSP and I will set up
impact for itself, right?
>> Jennifer Rexford: Yeah, yeah. Exactly. The main difference here is we're
continuously doing updates. You could think of auto bandwidth as a, you know,
pull the parachute because things have gotten bad. Here, we're going to be
continuously adapting the sending rates and splitting ratios over the multiple
paths. And we don't have a target capacity for that path at the beginning, we're
just essentially always letting the network tell us what to do. So auto bandwidth
is conceptually similar but it's a bit more of a pull the parachute when the plane's
applying up kind of thing. Yeah?
>>: Price is [inaudible] or just a ->> Jennifer Rexford: Price is what?
>>: Is that a [inaudible] feedback or is it ->> Jennifer Rexford: It is. It's a number. It's a number corresponding to the
current price. So if you were to think of it in ECN terms, it would be a multi bit
variant of ECN. If you wanted to discretized that and carry it in the path. Yeah
>>: On the average how many source definition paths do you have?
>> Jennifer Rexford: Maximize of our experiments we had about three to five.
And that was primarily because the topologies we're working with were a
backbone network topologies where you really between city pairs would have
something in that ballpark. And we thought that was sort of reasonable overhead
to put on the ingress switch. I'm curious in the data center context what would be
the right thing to do. I assume that number would be too small to make more
effective use of a data center network that has more paths of comparable latency
available to them.
Okay. So that's roughly what TRUMP does. And then it does a little local
optimization at the ingress node which could either be an end host or a switch to
figure out essentially the best way to decide how much traffic to send along those
paths subject to the price of each of those paths in order to maximize utility
subject to some aggregate price you're willing to pay.
Anyway, I'm kind of glossing over details here in the interest of getting through
everything. But that's the basic jest. And I should stress again, TRUMP is not a
decomposition. It is not yet another decomposition. It does not pop out of the
theory, it popped out of our intuition applied to the experiments applied to the
protocols we got from the theory. We can prove a few things about it. We know
under certain conditions that it's stable and optimal, but we can't prove as strong
of results as we're able to prove for other protocols we started with.
When we did this, you know, massive multiparameter sweep in MATLAB, we
found it converged substantially faster and it has only one parameter that's pretty
easy to tune. So we're happy for the tradeoff here, even though we don't have
as rigorous of results from the theory.
Okay. So it's still as you can probably tell from the fact we're evaluating it in
MATLAB, it's still mostly a numerical experiment, it's not real a simulation. It's
just a bunch of equations that we're iterating through with assumptions about
traffic being a fluid and about feedback delay being constant, all of which are
really not true in practice.
So the next thing we do is we translate that into a packet-based protocol. And
nothing here will surprise you. This is pretty easy stuff. We're now going to have
time intervals T, we're going to count load as the amount of bites that go across a
link during that time period, and we're going to update the link price for every time
period. And we're going to define that time period to be the max of the RTTs of
the different paths that are being used. We could do it less often than this. This
is something that seemed reasonable to us because you reasonably can't get the
feedback from all the paths any quicker than that. We haven't yet understood
deeply how much larger we can make it than this and still have the stability
properties that we see in the experiments, but with this update, we do -- we do
see stability.
Okay. So what does it look like in the end? Oh, I should say also when a new
flow arrives or departs, it knows from the ingress node what the prices of the
paths are, and so it's appropriate rate for sending can also be determined directly
because we know the current -- our current view of the state of the network.
Okay. So unlike TCP that does, you know, sort of a slow start, we know the
appropriate rate that this flow can send out amongst the group of flows it's part
of, because we already know what the conditions on the network are. Okay.
>>: I'm sorry a new [inaudible] doesn't really talk about the [inaudible] flows,
>> Jennifer Rexford: So that's an interesting question. If we were trying to just
did the -- so TCP actually does have a notion of fairness when it does that. We
don't necessarily inherent that notion of fairness. So we looked empirically at
that, and we do see that we tend to do a pretty good job evenly, you know,
essentially not starving. But the property that TCP has where it actually does
give a certain fair share doesn't inherit -- we don't inherit that property because
our objective function is different.
>>: I know for -- what kind of fairness do you end up with?
>> Jennifer Rexford: It's not actually obvious what we end up with.
>>: I mean using the log function automatically implicitly gets you some kind of
fairness, right, because if you separate out people too much, the penalty you get
for giving somebody more but the benefit you get from giving somebody more is
far less the penalty you get for ->> Jennifer Rexford: Definitely. So we expect we're doing something
reasonable, but because we're also subtracting this F term, we can't be 100
percent sure what we're getting. But the intuition especially if we keep W that
sort of weight between user utility and network utility to be pretty small. The
hope is we inherit similar fairness properties to TCP. We can't prove that in the
general case.
Other questions? Okay. So that's essentially what the protocol looks like. So
then we went NS-2 and did more realistic experiments with real delays, with real
paths and with an on-off heavy held traffic model mimicking what we kind of
expect web traffic to look like.
And essentially what we want to know is all these multi factor experiments we did
in MATLAB to be able to stumble along the right answer to our problem. Were
those things misguided because there was such an abstractive view of what the
network looks like, or did they give us a reasonably good view. And in fact we
did NS-2 experiments for the other protocols to get a sense also of whether our
MATLAB results were accurate. And actually they were pretty accurate in
practice. So we feel like the MATLAB results that we used as sort of a sign post
along the way gave us at least some insight of which way to turn at different
stages in the work. Yeah?
>>: [inaudible] in terms of [inaudible].
>> Jennifer Rexford: Relatively. We tended to pick the K shortest paths where
we did -- if the K shortest path shared so many links in common, we would
essentially try to have path that is were a little bit more diverse. So you could
think of it as a tradeoff between K shortest paths and K link disjoint paths,
something a little in between that.
>>: If you had [inaudible] would you still include them in your ->> Jennifer Rexford: No. I mean, you could. But they're not going to help you
much because in the end you're probably not going to use them very much. So
we chose to limit the number of paths to be sort of in the three to six range.
Again, the thing, protocol would still work and in fact would provably work at least
as good, if not better, if we included more. It's just that we chose not to include
them because we assume they would end up carrying infinitesimal amount of
traffic anyway. And in fact we found that as we did experiments where we
increased the number of paths. Once we got above four, we didn't real see
additional benefits from having more.
>>: What was the [inaudible] increasing the number of paths but actually we -diversity in the [inaudible].
>> Jennifer Rexford: Yeah. There's a trickery to that in the sense that if your K
shortest paths have too many links in common you don't get enough of benefit
from the diversity. And yet if you pick completely disjoint paths, the K paths you
pick might have widely varying lengths. And so I think there's some art to picking
the paths. And we did some experimentation with that. And certainly if we just
pick only the K shortest paths we don't do as well as if we pick kind of a mix of
disjointedness and shortness. Don't know what the right answer there but I think
picking a handful of path that is are not much longer than the shortest path that
have sort of maximal disjointedness subject to that, something in that ballpark we
think will work well. But that's certainly in the realm of human engineering, not
something we can say something [inaudible] about.
So how do we? So I'm running low on time. Yes?
>>: Just some common way of [inaudible] similar [inaudible]. So what we did
was [inaudible] where of course that's an [inaudible] algorithm. So [inaudible]
matrix. And I couldn't make a problem convex by using the [inaudible] the
information about [inaudible] right. So therefore what you can do is you can use
some kind of maximum coverage.
>> Jennifer Rexford: That's interesting.
>>: And you essentially can flow model and [inaudible] for example the
>> Jennifer Rexford: Okay.
>>: I can reach some kind of cover like 90 percent, then of course then you can
do this for [inaudible] and go back and now [inaudible].
>> Jennifer Rexford: Okay. Very cute. That's nice. Yeah, that's exactly what
we don't have. I mean, we're essentially doing things like multicommodity flow
and then picking the paths that tended to carry traffic, but we weren't real
formalizing it quite that much. That's really cool. Yeah.
So now I'm just going to really briefly given I'm low on time run through just a into
few graphs. So what we're seeing on the left here is a time series plot of TRUMP
running and the Y axis is the is the aggregate throughput. We keep in mind here
aggregate throughput includes stuff that's getting lost, so these little spikes you're
seeing on the upper left are traffic that's in fact it's a -- looks like it's a good thing
but it's actually bad because traffic is getting dropped in the floor. So what we
see at steady state is what load the network is carrying in aggregate. So you can
see we make W this parameter where we are conservative, too high, we end up
being so conservative that we hurt the aggregate throughput.
But in fact, we can actually do pretty good even having protocols that are much
more stable by having W just be kind of in the one-third to one-half range. So
that's just sort of a quick confirmation of that.
If you look at the graph on the right, that was one of the earlier protocols that we
evaluated optimized for the steps I said the best value we could possibly get for
it, after sweeping MATLAB like crazy. And as you can see, the way it behaves
under two different settings of its step size it's pretty -- well, it's pretty sensitive,
those two step size a little hard to see are pretty close to one another. The black
curve is doing as good as we could possibly good if we let MATLAB search for a
long time to find exactly the best value. The blue curve is not much different in
step size, and yet you can see it things just going nuts.
So again, we have a protocol that's provably stable and optimal, but we have to
have the right step size to get it to work, whereas we found with TRUMP actually
it was pretty insensitive to the tuning parameters.
We did some experiments with failures. So here we look at three different paths
that are carrying traffic between a particular ingress, egress pair. The blue, the
red, and the green are the loads on the three different paths. Actually what you
can see from this actually first off is that the vast majority of the traffic is going on
one path, and in fact this is a recurring theme. In other words, if somebody
asked earlier, do we really need multiple paths or do we just need flexible
splitting on whatever paths we have. And the latter is the point. And in fact, for
the vast majority of search destination pairs we do see 100 percent or nearly 100
percent of the traffic going just one of the paths. When we fail a link on the green
path, we see the system adjust pretty quickly, and it put traffic more of the traffic
now on the third path, the red path. Although less because now the network has
less capacity and is not as able to handle the load. But it responds pretty quickly.
And here the time is in seconds based on a -- I think it's the Sprint topology
looking at the link between New Jersey and, gosh, Indiana, I think it's Indiana.
And finally we also did some experiments where we varied the average file size.
So we looked at the web objects. They have a pareto or exponential distribution
of file sizes and when changed the mean file size. And we're looking here at a Y
axis where 100 percent is good, where we actually maximize the aggregate
throughput. And as you can see, if the file sizes are really tiny, we're not going to
do that great. But if the average file size is reasonably big, where we have at
least a few round trip times for the flow, we actually get pretty quickly up to a
reasonable use of the network.
And it's actually fairly robust across a range of different distributions. So the
mean seems to be more important here than how heavy the tail actually is.
Now, if you were doing TCP as you might expect, small flows are also extremely
problematic for TCP, and in fact they're even more problematic than they are for
us, largely because we essentially have information about the state of the
network when the flow starts. So we're able to jump in right away but it's still
going to take us a little bit of time. The flow size has to be -- flow has to be
around at least for a little while for us to not have such bursty traffic that we don't
get to the right place before the flow actually ends.
And finally this gets at a question that's come up in a few guises here already is
how many paths do we need, so this is looking at time on the X axis looking at
the aggregate throughput, depending on how many paths we let the network
have. The black curve is where we force everybody to use one path and the blue
is two, yellow is three, green is four. So essentially having three or four paths,
we're starting to get extremely diminishing paths at this point by adding more.
This is for the -- I think this Sprint topology or it's an ISP level topology.
Now, again, if you were to look at a topology with greater natural path diversity,
you might need more than this. But what we found for the back one networks is
that there were so few path that were substantially close in latency and number
of hops that really that the fifth, fourth, and sixth path were so long that you
wouldn't really have wanted to put traffic on them anyway. So for the most part
the additional paths don't buy you much in this context.
>>: In this context you have the [inaudible] why you getting integers of
>> Jennifer Rexford: That's a great question. That could very well be. I don't
>>: I mean just -- just as an experiment if you have like a fully [inaudible] graph,
do you [inaudible] going to see if getting ->> Jennifer Rexford: Yeah, that's a really good point. Yeah, even when we have
four paths, we really tended to see 100 percent rather than splitting.
>>: No, but I [inaudible].
>> Jennifer Rexford: They're largely disjoint. Not completely.
>>: Because if the -- if the two [inaudible] flip flows are entracting at any given
point, then you might -- you might just say that all you know, I'm just better off just
sending everything on one path. And what we're going to [inaudible].
>> Jennifer Rexford: Yeah. That's a really good point. Yes. So we didn't do
that. But that's a really good point. So any way, to include TRUMP in the end is
-- has one easy to tune parameter. It only needs to be tuned when this W value
is set really small, which we tend not to want to do anyway, because it makes the
system not terribly robust. It's pretty quick in recovering the link failures and
recoveries. And it seems to perform in a way that although it depends on file size
it's not terribly dependent on the variance of file sizes. And we call it TRUMP
because it trumps the other algorithms.
We had and earlier protocol I didn't subject you to called dump. And any way we
dumped dump because trump umped. Any way, trump is better than dump. And
we think, although we don't know for sure, it's something we're still working on,
that it might be possible to design a variant of TRUMP that works with implicit
feedback. Note here we are passing back prices on the links about with
information that kind of maps to loss and delay variation. We think that that
intuition could be applied to have a variant of TRUMP that works with the end
host or edge router implicitly inferring the loss and delay variation. But we
haven't yet gotten traction on that problem.
So another question you might have is okay, so so far I've still been pretty
abstract. Yeah?
>>: How many [inaudible].
>> Jennifer Rexford: I think we're just representing them as floating point
numbers. That's a good question though, what level of granularity do we really
need. I suspect a handful of bits is probably enough. But we definitely need
more than one.
>>: Because I mean sort of the [inaudible] like I said, the TCP you can sort of
reverse engineering and get the same thing, but usually there's like a different
equation model of TCP, and they assume like many flows each TCP source is
[inaudible] and something I was always wondering is, is there any sense as how,
how much mixing you need to actually be able to, you know, have that model
represent anything.
>> Jennifer Rexford: Yeah, that's a really good question. Yeah, I don't know if
anything we do here sheds light on that, but it's a good question. And we
definitely are taking advantage of some of the dynamic range of the price
information here, too, so it would be interesting to know if we represented it
logarithmically, how many bits would we really, really need to do it. I suspect that
we'd need a handful, but hopefully not, not a huge number.
So one thing you might be wondering is what are these components? I've talked
pretty abstractly about the edge and the link and so on. And so what is the new
architecture, if you will. So if we look at the picture I had at the beginning of the
talk, the operators are tuning link weights and setting penalty functions and such.
Under TRUMP, they're at most on a very coarse time scale computing the paths
on behalf of the routers. If we don't want the routers to do that themselves. And
they're tuning some offline tuneable parameters. So they're not doing very much.
The sources in today's TCP are adapting their sending rate. Here we're adapting
the actual rate on each path. And today's routers are doing some form of
shortest path routing. Here they're not doing anything except computing prices
on the links and you feeding them back to the sources so the sources can
appropriately split the traffic over multiple paths. And it's something I've talked
about with a number of people in this room is whether in fact this would allow the
routers to operate without a control plane. Right? Particularly you see the things
that are in parentheses here, who is doing this? The multiple paths can be set
up by the routers or by the management system. The prices could be computed
by SMNP polling of the link nodes or by the routers themselves. And so really
what's nice here is you could actually vary quite a bit what is actually done by the
routers and in fact could even imagine an implementation of TRUMP where the
routers do almost nothing except collect the measurement data about link load
on a, you know, some sort of round trip time scale.
Okay. So I started off by saying the math was going to give us an architecture at
the end, and that was a little bit optimistic. So what we end up with is a division
of functionality but exactly what is the source and what is a router is still left a bit
open. So the sources here could be the end host, could be the edge router, or
even could be a mix of the two where maybe the end host is computing or
enforcing an aggregate throughput and the edge routers are doing the splitting
over the multiple paths. So there you could imagine an end host variant of this
which might be appropriate in a data center network. You could imagine an edge
router variant of this might be more appropriate in an ISP.
The feedback in everything we've assumed so far is explicit about the link loads
and the prices that are computed based on them, but we suspect but we're not
100 percent sure yet that an implicit version might be feasible. And of course
that would be real nice if we wanted an interdomain version of TRUMP, right,
because right now we're relying on quite a lot of cooperation from the network
elements that would be extremely problematic in an interdomain setting.
And finally the computation of the paths, of the prices of the path rights could be
done at the links and the sources or could be done by the management system
itself and pushed down into the network elements and collected up from the
network elements if one wanted to do. So in fact, the network management
system could implement almost this entire scheme just by monitoring link load
and pushing down ascending rights down to the -- down to the end nodes or be
nod involved at all if you put all that function in the routers.
And we view that as a plus that in fact those questions are left open because the
actual choice between these things might be driven by completely different
issues like trust and security issues about whether you trust the host or you trust
the routers. Maybe if you're Microsoft you trust the host and if you're AT&T you
might trust the routers and so you both have a variant of TRUMP that might
make sense, depending on who you are.
>>: [inaudible] when we ask the question you actually said not TCP so you were
thinking edge routers [inaudible] explicit so [inaudible] in your own mind it's not
real [inaudible].
>> Jennifer Rexford: Yes and no. You're right at some level but you could
imagine what the edge router is doing is shaping at sending rate in which case
the host could send it whatever rate it wants but it's going to get rate limited as
soon as it tries to exceed the shaper. In which case you could either say well it
should rewrite its code so it doesn't hurt itself by exceeding it only to have its
packets dropped or you could also look at how well would somebody
implementing TCP do subject to a rate limit that's been computed by the network.
We haven't studied that, but one could do that. It would just be clumsy, I think, to
do it that way, but it's certainly feasible to do it. Yeah?
>>: I think it would be really helpful to compare this to RSCPT and auto
bandwidth to [inaudible] I'm wondering if the utility of going out and having to pull
the device [inaudible] pretty much [inaudible].
>> Jennifer Rexford: Oh, that's a really good idea.
>>: So it might be a very interesting adjunct ->> Jennifer Rexford: Yeah, that's a great idea. Certainly MPLS could be used
here for the multipath establishment. And we've looked at the splitting ratio
business can also be done using existing MPLS features as well. But we haven't
done the rest of the comparison you mentioned.
So I'll just conclude by saying we've been looking at implicit feedback based
extensions to TRUMP, and in particular we're interested in that because if we
wanted to do this in an interdomain setting we view sort of the reliance on explicit
feedback at non starter there, so we want to try to resolve that.
And I'll just end by touching on a question that came up at the very beginning of
the talk. And excuse me for running over. I talked purely about throughput here.
But you could imagine applications that would prefer utility function that captured
delay. And so we've gone through this exact same exercise for delay and have a
different protocol that is structurally very similar multipath feedback of prices and
adaptations of splitting ratios for traffic that's delayed -- in elastic traffic that's
delay sensitive, and a similar kind of thing we can actually find the provably
optimal stable, blah, blah, blah, and a variant of it that's practical. So pretty much
the same story, just with a different utility function.
Now, what if you have both in your network? You have both delay sensitive and
throughput sensitive traffic? Well, then essentially your objective is a weighted
sum of the two objectives in these two earlier studies. And what's actually really
an interesting gift from optimization theory is that optimization problem can be
decomposed in a two optimization problems that are largely independent, that
correspond to the two protocols that we just arrived up here. So essentially that
optimization problem tells us that the right thing to do is to run the optimal
protocol for each of the two classes and have an adaptive splitting ratio for each
link to use in allocating bandwidth to those two different classes of traffic.
And so this we think was an interesting framework for what we call adaptive
network virtualization where the network is running two customized protocols in
parallel, but it's dynamically adapting what share of the bandwidth each link gives
to those two classes of traffic to maximize their aggregate utility at the end. And
so we think in general, this body of techniques that Mung and Steven and Frank
have been doing over the years is really quite existing, it provide a way for us to
think much more methodically about doing network protocol design and although
there's certainly a place for engineering judgment and human intuition, the theory
at least points us in the right way so that we hopefully make better decisions.
And I'm sorry for running over. I'd be happy to take questions if I'm not already
over too long. Thanks.
