>> Ming Zhang: Hello, everyone. It's my pleasure to... University. He has been doing an internship at MSR for...

advertisement
>> Ming Zhang: Hello, everyone. It's my pleasure to introduce Harry Liu from Yale
University. He has been doing an internship at MSR for three times, and he has done a
lot of great work on software-defined networking. And I'll let him start. Harry.
>> Harry Liu: Okay. Thanks, Ming, for the introduction and thank you guys for attending
this talk. Today I will present my research in the recent two years. Actually it was done in
this build and it's about how we avoid the congestions proactively under network
dynamics. So the story started from the trend that people are using more and more
interactively applications. And it is a big challenge to support such kind of applications
because people become very sensitive to the delays in their interactions. We know that
the communication delay is one major part of the delays in the interactive applications,
and if we take a closer look at the communication delay we can see it includes the
propagation delay, in-network queuing delay and re-transmission delay due to the
packet drops.
Well, propagation delay typically is small and stable. But queuing delay and the packet
drop can be significant when congestion happens. Therefore as a network provider if
they want to provide better support for the interactive application, the essential question
they want to answer is how they deliver the traffic without congestion. Well, this sounds
like a very classical problem and it's not very hard to answer when the network is stable.
However, we know that the network in reality it keeps evolving due to various kinds of
faults like link failures and maintenance activities such as device upgrades. [Inaudible]
these dynamics is how to answer this question becomes tricky. Well, nowadays in state
of the art people propose a lot of intuitive ways to try to prevent congestions in
dynamics. For example, for the unexpected faults they [inaudible] capacity overprovisioning and they try to leave a lot of room in the network. Hopefully this spare room
can accommodate all the traffic spikes when the faults happen. And some of them put
interactive delay-sensitive traffic into the high-priority queue in their switches.
For another example, for the maintenance activity, they typically prefer to work in the offpeak hours. And before the maintenance they also make a lot of sophisticated
maintenance plans, and hopefully they can reduce the impact t the application. Well,
these intuitive ways sound reasonable; however, if you ask me whether they work well, I
will say they do not. So the big reason is that most of them require a lot of overhead and
they cannot even offer any guarantee. And this prevents us from offering a high-quality
SLA for the applications. And we think why is the answer to this question still in the
dark? Fundamentally it's that we don't have much foundation understanding of this
question.
So this is the major motivation of my Ph.D. research, all my research in MSR. In the
science space what I am trying to do is to provide some systematic understandings on
how we avoid congestion in network dynamics. And based on this understanding I try to
deploy and design, implement and evaluate some tools for the operators for them to
manage their network traffic in a safe way. And in this talk I will show all of them to you.
This is an overview of this talk, and it includes two projects. Each project sells a single
concept. The first concept I will show you is the Forward Fault Correction, and it's
designed for handling the dynamics caused by faults. And the second concept is called
Smooth traffic Distribution Transition. It's designed for handling the dynamics caused by
maintenance activities.
First: Traffic Engineering with Forward Fault Correction. Let's first take a brief glance at
traffic engineering with [inaudible]. So given a network such as this small network with
link capacity of 10 and when one group of hosts want to send traffic to another group of
hosts in another location, it actually creates a traffic demand. And the network can
choose whether to support it and whether to fully or partially satisfy this demand. For
example, G1 wants to send traffic at rate 10 to G2. The network decides to fully satisfy
this demand and it also decides how to supply this, so it decides which path it'll use and
how much traffic each path carries. Of course sometimes the network can also partially
satisfy a traffic demand. For example, even if G1 wanted to send G2 with the rate of 100,
the network can also give him 10 and how to deliver this 10.
Overall a quick summary: traffic engineering makes two decisions. First the bandwidth
each flow can sent at which we call granted flow size. Second, how much traffic each
path carries. But traffic engineering is very critical if we want to use the network
efficiently. For example, suppose initially we only have two flows and we deliver them in
this way. And later here come some new flows, and to accommodate more traffic with
the existing network resources we need to update the TE and adjust the TE to handle
more traffic demands. And because the traffic demands are changing frequently the TE
also needs to be changed frequently. So in a report last year on average the TE update
interval in Google is around 2 minutes and in Microsoft around 5 minutes.
All the TE systems have very similar architecture. First they have a conceptualized TE
controller and it relies on the monitoring system to give it the current traffic demands and
the network topology. And after the TE decision, it will rely on the controlling system to
translate the TE decision into all kinds of rules and configurations into the network and
push them into the network devices. Well, this kind of architecture is clean and simple;
however, it has a very strong assumption. It depends on the reliability of the monitoring
and the controlling. However, they are not 100 percent reliable. For example, we all
know that link failures or router failures are common, and due to these failures the
network topology is changing all the time. So that means that the TE controller always
has all the versions of the network topology. No one can guarantee [inaudible] second
that the network topology is still valid.
And another problem is that no one can guarantee that the controlling system can push
all the rules in your batch and successfully configure all the devices all the way because
sometimes configuration failure happens which means the controlling system might fail
to configure some devices. And these kinds of faults, both kinds of faults can cause
severe congestion. Let's take a closer look at them. First let's take a look at the
configuration failure. [Inaudible] TE controller once you configure a router, it needs to go
through some steps. Of course first it needs to go through the control network and send
the configurations to the router. And the router's control plane accepts it and translates
the flow rules and insert the rules into the data plane. During this process, a lot of factors
can trigger the configuration failure. For example, sometimes the TE controller looses
connectivity to the router and sometimes due to the overloaded memory of the CPU or
software bugs, the control plane of the router stops working or translating the
configuration. And sometimes the data plane is short of memory and rejects all the newly
inserted rules.
Last year Google reported that the rate of their configuration failure was around 0.1
percent to 1 percent which is not a small number. And these kinds of configuration
failures can cause congestion. For example, we've already seen this. To accommodate
some new flows, we need to make this TE update and this shows what happens. And to
finish this update we need to configure three switches, and this just shows what happens
when one switch, s2, fails to update and is still using the old configuration. And this link
will be congested. After the configuration failures, we consider hardware failures. So we
can know that a lot of factors will trigger the link failures or the switch failures such as
instable power, human misconfiguration or even a loose connection of the cable.
We did a larger scale [inaudible] in Net8075 and we just wanted to find the frequency of
the link failures. The results show that, for example, in a 2-minute interval the probability
that it has a single link failure is above 10 percent. And in a 10-minute interval the
probability that it has three link failures is about 1.4 percent which is also not a small
number. Well, hardware failures such as link failures and switch failures can also cause
congestion. Suppose this is the initial state of the traffic engineering and the link from s2
to s4 fails. What happens next is that the first s2 will quickly detect this failure and it will
perform traffic rescaling which means that it will disable all the tunnels going through this
link and send all the traffic to the residual tunnels. As a result it sends all the traffic
through this tunnel and the congestion happens. Let's make a summary. So what we
care about are the common faults in the networks and they have two major categories.
The first is configuration failures which we call control plane faults which mean that a
router will take a long time or even fail to implement some new configurations. And the
second type is hardware failures or what we call data plane faults which mean the link or
router goes down which shuts down all the tunnels going through it.
Well we want to handle these kinds of faults but we are facing a very big challenge; that
is, we only know that given a moment, we are almost assured that some fault will
happen but we don't exactly know where the fault is. And straightforward way to handle
this challenge is that we don't even predict the faults but we react to faults. However, this
reacting approach has its own problems. It's not efficient. The first reason is that reaction
always happens after the congestion and it cannot provide any guarantee for the
congestion. And how much we suffer depends on how quick we can face it. But
unfortunately sometimes it takes a long time to finish the reaction because a reaction
includes detecting the failure and recomputing TE and updating the configurations all
around the network. Sometimes this takes a long time. And what's more is that
sometimes the control plane failures also couple with data plane failures. For example –
Please.
>>: Oh, I'm sorry. Go ahead and finish your [inaudible].
>> Harry Liu: For example, some link goes down and we want to fix the congestion;
however, a control plane fault happens and we cannot finish this reaction so we will
suffer congestion for a longer time. While the reacting approach does not work we think
about the other side. So we think about a proactive approach. By proactive approach we
mean that we want to design a TE that is not only congestion-free in current situation but
also robust to fault cases that can happen in the future. And that's why we introduced the
concept of Forward Fault Correction in traffic engineering. And it says that....
>>: That statement is [inaudible] incomplete, right? You can do that if you no resource
limitations, right? You create – So the statement that you made has to be under some
constraint of how many resources you may use with that.
>> Harry Liu: Yes, yes, yes. I will make sure and [inaudible]. So all the resources I used
to do that will translate into the overhead we pay in the network throughput. So we suffer
from some network throughput. But we will show this kind of tradeoff later and we will
find something more interesting. And that's why we defined this concept and it says that
we want to find a the which spreads out the traffic over the network so that no
congestion can happen if only the number of faults is under K. Okay?
>>: Do you characterize these router faults such that did the failures happen all at once
or do they gradually happen? Like in disks, disks don't just fail right away; they usually
send out some signal over time that they're failing [inaudible] and they're getting worse
and worse. So there's an intelligence in the disks, smart data collection, where you can
learn that the disk is going to go bad and take the [inaudible]. Is there something similar
to that with networking as well? Is that where you're going?
>> Harry Liu: Currently I think the failure means that the router goes down or stops
working and we shut it down.
>>: Does the router degrade or does it just go bad? I mean, is there a way for you to
detect earlier that it's degrading forwarding traffic such that you can be smarter in
probing and evaluating instead of waiting for a complete failure?
>> Harry Liu: Currently we didn't observe that. From the log what I observed is that it
shuts down, from the log. And if it gradually goes down, I think it can be translated into
the link failure first. So...
>>: I don't understand the last – When it gradually goes down it can be translated into
the link failures. I think the question is that you could proactively do something if you're
sort of querying these routers to see if they're getting into a bad state or not. So when
you looked at the logs, did you actually analyze what led up to the failure or did you just
say, "Okay, a failure is happening."
>> Harry Liu: It only says the failure happened because the log is from SMP.
>>: So there's no performance even really when all you have is – Okay, fine. But what
Ronnie's asking that might actually be true if we had better measurement techniques, we
had more [inaudible].
>> Harry Liu: Right, right.
>>: But in reality most of these failures are not caused by [inaudible]; they mostly
caused by software bugs. So if it happens, it happens. It was triggered by something.
>>: I see.
>>: It's a valuable [inaudible] we have not acted on that. We are treating them as
[inaudible].
>>: Even if – You know, you might also notice that it may not be – there might not be
any failures but most management tools work on sort of thresholding things, right? So
you sort of see that SKUs are starting to build up or it's not being – it's not acting fast
enough. That is a symptom of why that is happening. I don't know if you've looked at that
or not, but a lot of the tools actually provide that information. Like smart CMCs provide
[inaudible].
>> Harry Liu: Okay. Currently in this model what's important is that whether the router is
problematic or not, if it's problematic all the [inaudible] switch will try to move the traffic
out of their switch. So it's kind of --.
>>: Not even with like the [inaudible].
>> Harry Liu: Of course we have different faults so we have different K. And maybe it's
a cursory of some similar concepts in other fields, and that's right. And we get this
inspiration from the FEC, Forward Error Correction, in data encoding which says that the
receiver can recover all the information if only the number of lost packets is under K.
Well this similarity is not [inaudible]. Actually we're sharing the same insight which says
that, yeah, it's quite random and we don't know exactly where the individual faults
happen. However, statistically the number of faults can be very small and stable. So if
we achieve this concept in traffic engineering with a reasonable K that means our TE
can be robust through the majority of the possible cases that we can have in the future.
>>: So Ming just made a comment that these failures happen because of software bugs.
So if you're using the router or the switches from the same vendor and they've all been
updated, shouldn’t they not be that random? Right? If there's a new software update that
[inaudible]...
>> Harry Liu: You know, to me...
[Multiple inaudible comments]
>>: ...[inaudible] one, right?
>>: [Inaudible].
>>: Well, either one wins.
>> Harry Liu: Yeah, I mean I think from my observation the failure of the router comes
from the power usually. Sometimes it just keeps rebooting or unstable power [inaudible]
shutting down. So for [inaudible] sometimes it's triggered by some cases so sometimes
only some individual will happen maybe. Okay...
>>: I [inaudible] asking about correlated failures or like...
>> Harry Liu: Yeah, yeah, yeah.
>>: You haven't said what we do about [inaudible].
>> Harry Liu: And even if some failures are correlated with each other, if only the
number is not matched so it's covered by K, it's still okay. Okay, FFC: this is an
ambitious concept and before we show how to realize let's take a look at some simple
examples to show it does exist. In this example we show the FFC for the control plane
faults, and we've already seen this update before. We know that when s2 fails to update,
it will have some congestion. However if we can be smarter we can have some FFC TE,
and we can see that only – and this FFC is for K equals to 1. And the only difference
between FFC and non-FFCC is that on this link we only send 7 new traffic. And because
initially the blue flow and the green flow has 3 traffic on this link, if we 7 here and if only
the 2 do not fail together at the same time, we are okay. Of course we can increase the
production level. For example, K equals to 2 and in this TE we only send 4 traffic on this
link and because initially it has 6 that means that even if s2 and s3 both fail to update,
we're still okay.
And from this we can also see the trend that when we increase the production level, the
natural throughput is all going down. And it shows the fundamental tradeoff between the
robustness and the utilizations. And later we will show much more interesting results.
This time will first see this tradeoff. And this case shows the data plane FFC. For
example, for this traffic engineering, this non-FFC because even if one link fails, it will
trigger the congestion after the [inaudible]. However, if we just reorganize the traffic
smartly such as this [inaudible] TE which means that – K equals 1 which means that no
single link failure case will cause any congestion. For example, this. After the rescaling
there's no congestion and because the topology and the [inaudible] are symmetric, you
can easily verify that for any single link failure it has no congestion. Well, now let's
consider how we release the FFC concept. We are facing some large challenges.
The first is that we are handling a huge number of fault cases. For example, the network
has [inaudible] and we want to design TE that's robust to any arbitrary K link
combinations failure, that means we are handling [inaudible] K. And if we consider
various types of faults together, this number will be even larger. And of course we need
to consider the overhead in the network throughput because, otherwise, there is a very
simple solution: we're sending nothing to the network so we reach [inaudible] robust
[inaudible]. But of course that is not what we want.
Here is the roadmap. We already introduced the background, the problem, the definition,
the challenge of FFC. Now we come to the practical realization of FFC. First, I want you
to compare two kinds of constraints. The first, of course, is FFC constraint. We see that
arbitrary K faults cannot cause congestion. And then the second type of constraint says
that the sum of arbitrary K variables out of n ones is within an upper bound. Okay, they
must seem to be [inaudible] with each other but you can feel there are some similarities
between these two kinds of constraints. And if you can feel this similarity, you can guess
our key idea of solving this problem. Our key idea is that first we will formulate the FFC
constraint for each individual type of failure. And they have different forms. However,
what we find is that we can transfer them into a single format which is K-sum constraint.
And the benefit for this transferring is that this kind of constraint can be efficiently solved
by [inaudible] sorting networks. We can uniformly and efficiently solve all the types of
faults for FFC. And in the next few slides I will firstly introduce this part because this is
the core.
We can equivalently transfer the K-sum constraint to another form. What it's actually
saying is that it will require that the largest sum of K variables out of n is upper bounded.
Well, because when do we have a variable we actually don't know which is larger and
which is smaller so that there is a straightforward way to express this kind of constraint.
We try all the combinations of K variables and then, we will have [inaudible] constraints.
And with a reasonable [inaudible] the computing time of the TE will be very large.
However, what if I tell you that we have a way that we can equivalently compress this
number of constraints into O kn constraints? And as a result the computing time is within
1 second. So how can we do that? First let's be crazy enough to actually sort the
variables. Actually we can sort the variables and at least first of all we can sort two
variables. Given any two variables, we can introduce two new variables: Xmax and
Xmin. And add these two linear constrained. So no matter whatever X1 or 2 is, Xmax is
always the larger one and Xmin is always the smaller one. And what we want to do is
extend the two cases into n cases. It says that given n variables, we want to introduce a
collection of new variables Y and a collection of constraints between X and Y so that Yj
is always the jth largest element in X.
And to achieve this we borrow the ideas found in sorting network. This is an example of
the sorting network that is widely used in circuit design. Let's first look at the legend. So
it shows the gate where there are two inputs and you always put the larger one of the
inputs up and the smaller one down. And we can use this gate to organize a network
that takes an arbitrary number of inputs, and the outputs are always the sorted version of
the inputs. You can use the blue number to make a check and it shows which
[inaudible]. And we've already shown that for two variables we can achieve this kind of
compare and swap gate. And if we organize it just as the sorting network, we can finally
get – at least there will outputs in that representing the sorted version of the inputs. Of
course what we care about is only the largest K element. So that we don't have to solve
all of them, we use bubble sort and each time we bubble the largest out. And after K
times we already have the K largest variable and it's done. And we know that the
number of comparisons are under K times n. That's why I say the complexity is O Kn.
And here I want to mention Mohit and from discussions with him I got a lot of inspiration
and finally reached this point. Okay, I already finished this part and it is the core part. I
will [inaudible] transformation [inaudible] because it introduces a lot of notations, but I
want to mention that the key observation and the key contribution of this work is that we
show no matter – for all the faults that we care about, we can always transfer it into the
single format and we show that it can be solved uniformly and efficiently. Okay,
previously we can consider the faults one by one or one type by one type. Finally we
want to find a TE that can be robust to all of the faults at the same time. The solution is
simple. We just add all the FFC constraints together and get a TE that is robust to all
kinds of faults. This conclusion seems to be simple and intuitive but how to prove it
[inaudible] is not that straightforward, but today I will [inaudible] this part.
So we come to the evaluation. We use both a testbed evaluation and a large scale tracedriven simulation. We used testbed to simulate a global scale network, and we used
eight switches. And we mapped the switches to a location in the world, and we used
geographical distance to simulate the delays along the links.
>>: So I haven't gotten a sense yet – You were here at Microsoft with us, right?
>> Harry Liu: Mmm-hmm.
>>: How much of these links are actually under-provisioned for you to be able to do –
So if all the links are completely saturated or close to saturation, your technique doesn't
really matter?
>> Harry Liu: Yeah, I will show it later in the evaluation but what I want to say is that
there are two observations. One, if the network is under-utilized, we will show that we
prevent congestion and we don't have actually any loss in the network throughput
because the overload [inaudible]. And even if on the one case where we want to fully
utilize the network, because there are different types of traffic if we only protect at the
high-priority traffic is, where high protection level versus low protection level and even
we don't protect any [inaudible] traffic, in this way we will show that we will – we still gain
a lot. So at least for the high-priority traffic, we don't lose any throughput but we almost
reduce the packet loss to zero.
>>: Right. But that comment is not – [Inaudible] you decide that you lose the lowerpriority traffic and not the high-priority. I mean, you just reconfigured flows or whatever.
But my question is actually – maybe where I'm going with this is that potentially you're
saying that in order to get a certain amount of reliability from the network, you actually do
have to – the links have to be – or in a sense you need to create more links even if they
are not --. So you have baseline at which you [inaudible] the entire network that makes
sure that all the servers are going in line, right?
>> Harry Liu: Right.
>>: And [inaudible] not bottleneck. But beyond that you still probably need to wire more
to provide this level of reliability.
>> Harry Liu: Right, right.
>>: Is that where you would like to go?
>> Harry Liu: Sometimes we don't waste anything because, for example, for the high...
>>: [Inaudible]. But anyway, go ahead.
>> Harry Liu: For the high-priority traffic, we first layout the high-priority traffic and it
needs some spare capacity to prevent congestion. However, this kind of spare capacity
can be used by the lower-priority traffic. And finally we're still reaching a higher level of
throughput and when the link got congested what we do is the packet is the lowerpriority traffic. So what do I pay? [Inaudible] what do I pay? In the sense of the natural
throughput, we don't pay too much.
>>: Okay, [inaudible] my point. So maybe in particular data centers you have clarity of
high-priority and low-priority. But in other places if you want to kind of use the same
technique, for example in enterprise networks, it's not clear what is high priority and what
is low priority. So in that case I don't know what you [inaudible].
>>: Right. So in that case how much do you need to network to be over-provisioned for
these techniques to work?
>> Harry Liu: Let me show...
>>: Or another way is how much do these links have to under-utilized in order for this to
work?
>> Harry Liu: So if the network load, if the traffic demands are large enough, in our
evaluation we pay about 10 percent throughput to gain this [inaudible] property. If the
network load is not that high, for example, we scale it down by half. And under this load
we don't lose any throughput. We can always fully deliver all the demands. It's only a
matter of how we layout the traffic so that it can be robust enough.
>>: And going back to Victor's question: if there is no traffic prioritization and you've
already highly utilized links...
>> Harry Liu: [Inaudible] yeah.
>>: Is there nothing you can do? I mean, what do people do in that case?
>> Harry Liu: I think it's a very good question. And our argument is that without any
priority, no one is running the network [inaudible] that high utilization. And if we only
have one priority, people typically use over-provisioning to protect it. And our [inaudible]
is to offer a better way to layout the traffic to reach a better result. Okay. And by this
testbed we want to show two things. The first is that we want to show what exactly
happened in realtime after a failure happened. And we show why the FFC-TE can save data loss.
So let's see a specific case that includes 5 switches. And the link passthrough is 1
gigabit per second and there are 2 flows, blue and green. First we show the FFC-TE with
Ke equal to 1 and it's like this. So for the blue flow it used two tunnels and equally split
the traffic. And for the green flow it also used two tunnels and equally split the traffic. I
omitted a number here. And one link fails. What's going on? It's like this. So let's trace
the moment the link failure is at 0. At first s6 will detect this failure and then it will report it
to s3. And then, when s3 knows about this failure it will perform the rescale. And...
>>: Why did you choose this configuration? Is this a typical configuation?
>> Harry Liu: This?
>>: Yeah.
>> Harry Liu: This is FFC – This is clearly showing that why it's FFC because...
>>: Well, obviously. That's why I'm asking. If you can come up with an example which
really highlights your – So I'm asking why did you choose this?
>> Harry Liu: Oh, because this is [inaudible]...
>>: Is there any backing for this? Is there any reason to say that this is actually very
typical? Or are you just doing it to show FFC?
>> Harry Liu: Oh, I'm showing a very intuitive case to show what happens when link
failure happens, so I can give you a sense of how FFC works, how FFC can reduce –
what can FFC help or what can FFC not help. So, for example, before the rescaling the
s3 does not know about this failure. As a result it will keep sending traffic into the tunnel,
and all the traffic will get lost. So this power loss cannot be reduced by FFC. However,
what FFC guarantees is that after the scaling there is no congestion. However, if we use
a non-FFC-TE which the only difference is that the blue flow uses this tunnel rather than
this tunnel. When this link goes down, what happens? At first it's exactly the same as
with FFC and they also suffer from this loss. However, after the rescaling it saves the
loss on the tunnel and it triggers another congestion on the link. That's the key difference
between FFC and non-FFC.
And if we take a closer look you can see the difference clearly. For example, for the FFC
it only takes us some time to do the rescaling and after the rescaling it is done. There is
no need to recompute the TE or update the network. No need. And if we use non-FFC,
okay, after the rescaling there is a congestion. If we want to fix it, we're going to spend a
lot of time to do it. And if somehow we're lucky, we can fix the congestion quicker but
sometimes we're not. And it clearly shows why we would want to use the priority
approach rather than reactive approach. For the larger-scale evaluation, we choose a
net topology of Net8075. And the traffic demands – We rescaled the traffic demands a
little bit to show three different kinds of traffic loads. The first is the scaling factor 1.0
which means it's a well-utilized network. And when the scaling factor is 0.5 it means it's
an under-utilized network or what you call the well-provisioned network. And when the
scaling factor is 2, it's an overloaded network.
And for the failures we used a 1 percent failure rate for configuration failures and link
and the router failures according to the real trace. And we only compare the two TE
algorithms: non-FFC -- the basic TE without FFC constraints -- and FFC. First we show
the firsthand result when we protect all the traffic with a single protection level. And here
it shows the results. We two kinds of metrics; first is the throughput ratio which means
we – the FFC throughput normalized by non-FFC throughput so the larger the better.
And the second is the data loss ratio, so FFC's loss of data normalized by non-FFC lost
data so with this smaller is better. And we can see the results here.
For a scaling factor of 0.5, which means that network is well-provisioned, we can see
that we reduced the data loss by many times and at the same time we are very close to
the optimal throughput. And of course when we enlarge the traffic scale, we suffer from
loss of the throughput. For example, for the scale equal to 1, we lost 10 percent
throughput. Okay, we think this 10 percent is too much. We want to improve it. And the
intuition we want to get or how we improve it is we first studied the trade-off between the
throughput ratio and the data loss ratio. And on this each point is just the solution at a
specific protection level. For example, this is the non-FFC and FFC with Ke equal to 1, 2
and 3. And we can also see that this curve is also related with the traffic scales, and
when the traffic scale is small we can see that we can only pay a very small fraction of
throughput but gain a lot of data loss. And this occurs to me that, okay, in our network
not all the traffic needs high-level protection. We need to differentiate. So we followed
[inaudible] definition. We defined three types of traffic and we used three different
protection levels to protect them.
For the interactive traffic with high priority, we used a very high protection level. And for
the elastic traffic we used medium protection level. And for background traffic we don't
even protect it. And as a result we get this. We can see that if we use the different
protection levels: say, at a very high protection level the high priority traffic has almost no
loss. And at the same time it has the optimal throughput. And of course we can also
observe that for the medium type of traffic, it has much lower data loss and only suffers 2
to 3 percent throughput loss. For the throughput, it's similar in the low priority but what
we are talking about is the tradeoff so there's no free lunch. And we can see what we
pay is that we get higher low-priority traffic loss. However, this might be the cheapest
thing we can pay. And it also shows that FFC actually gave [inaudible] that it can twist
the tradeoff and that we reach the point that it will benefit us a lot.
And here we also show if we only use a priority queue, it sometimes does not work
because if we use non-FFC and even if the high priority has the high priority queue, it
also suffers from loss. And if we use the priority queue and the FFC together, we
achieve the goal to protect the high priority traffic.
>>: [Inaudible] such a big depth between the medium and the low. You go from under
20 percent to almost 100 percent?
>> Harry Liu: This?
>>: The one you have circled there.
>> Harry Liu: Here?
>>: And then the point before it or the bar is only about 16 percent and then you jump
up to nearly 100 percent. It seems like a huge jump. Is there any intuition why it's so big?
>> Harry Liu: You mean, between this and this?
>>: Yeah.
>> Harry Liu: Oh, this is protected. So the medium priority is protected so when some
single link failure happens it won't have any congestion. However, the lowest level of
traffic is not protected.
>>: Ah, I see.
>> Harry Liu: Okay, here we come to the conclusion. First we propose the FFC and we
believe this is necessary in the larger-scale traffic engineering because of all kinds of
faults. And, we designed an efficient algorithm to realize this concept. And then, we also
performed some evaluations. Especially we point out [inaudible] if we differentiate the
different type of traffic with different levels of protection, we can gain a lot. But network
faults are only one type of cost that costs the network dynamics. And the second source
of the dynamics is from the maintenance activities, especially in the data center network.
From time to time people perform a lot of maintenance, so we also want to protect the
network from congestion under this kind of maintenance. Sure.
>>: Before we go on, I was just wondering: you made this comment earlier on that
Microsoft realizes in 5 minutes and Google is 2 minutes. Can you say something about
why we're not as good or we're twice as bad as Google?
>> Harry Liu: You know, Google – When I reported the average Google actually they
updated the TE whenever something happened. So when some link goes down and
when some link goes up, it's always reacting to it. That's why on average that they are
faster. But that does not mean that we cannot do that. We just – I think we show the
curve that 5 minutes can be enough to reach a high utilization.
>>: So I think Google and Microsoft [inaudible]. The Google number is from their
production network. They react to all the failures versus [inaudible] right now it hasn't
been deployed; therefore, we don't have the failure rate. Once you consider failure, it
might be lower than 5 minutes.
>> Harry Liu: I know. [Inaudible] you've already seen this a few times. Sorry for the
redundancy. Let's begin. In data center network applications generate a lot of traffic and
rely on the network to deliver them. So when the network changes it also impacts the
traffic distribution on the network. For example, sometimes we want to upgrade the
switch which requires rebooting and before that, we want to remove all the traffic out of
this switch and do the rebooting and after they finish we want to move them back.
Sometimes we introduce a new switch onto the network, and we want to conduct some
existing flow so we go through this new switch to make use of this new capacity.
And when the network topology is changing the applications are also changing their
traffic demands. For example, if an application has virtual machines and these virtual
machines have traffic flows with each other, when the application decides to migrate
some virtual machines it also reshapes the traffic flows. And all the activities I mentioned
here are network updates. And we can see that network updates are performed every
day inside data centers; however, it's still a very great pain for the operators. Let us
study a story of an operator whose name is Bob. One day he wants to upgrade all the
devices in his network. He knows that some applications might have some concerns
when he performs it, so he firstly spends a lot of time negotiating with the applications.
He makes a very detailed plan and even asks his colleagues to revise this plan. And
then he chooses an off-peak hour to perform this; however, even if he exactly follows his
plan, he still triggers some unexpected application alerts. And what's even worse is that
some switch failures force him to backpedal several times.
Eight hours later Bob is still struggling because he made little progress but received a lot
of complaints from the applications. He has to stop now and becomes very upset. From
this story we can see the problem in the state-of-the-art when we plan the network
update. First, the planning stage is too complex and we're still facing some unexpected
performance issues. And it's requires human beings too much effort to perform it. Let's
take a closer look because we care about the applications when we perform updates.
We firstly think about what the application really wants. What the application wants is
first, reachability and second, low latency. In a data center we have multiple passes so
it's not a big issue for reachability. For low latency as we mentioned the key is that we
don't have any congestion. However, to make a congestion-free update is very hard
because later we will see that even if a small change will involve many switches. To
avoid congestion we need to make a multi-step plan. We have different scenarios of the
update to plan, and one plan you make for one scenario cannot be reused in another.
And sometimes when you change the network the applications are also changing their
traffic demands, so you've got to consider these interactions. Now I'll introduce the highlevel challenges when we want to achieve congestion-free updates. We take a case
study which shows a very small-scale data center topology, and all the switches are
using ECMP and their link capacity is 1000. And we highlight this link because later we'll
see this is the bottleneck link on the network and that there are three flows going through
this network. First a top down flow with a flow size 620. This type of flow [inaudible] goes
through the network to the destination, so it gives the 620 load onto the bottleneck link.
The second flow is a ToR to ToR flow. This kind of flow has multiple paths and because
ToR1 is using ECMP, AGG2 receives and AGG2 is using ECMP so Core 3 gets 150.
And after the traffic reaches the core, it will directly use the single path towards the
destination. We can see the green flow puts a 150 load on the bottleneck link.
The third flow is similar to the green flow and it goes through the network and finally it
puts a 150 load on the bottleneck link. In summary we have a 920 load so initially we
don't have any congestion. And now we want to perform some updates; for example, we
want to update AGG1. Before that we want to drain it. What we need to first do is
reconfigure the ToR1 to stop it from sending any traffic to AGG1. However, this simple
reconfiguration will trigger congestion because it will redistribute the green flow over the
network and as a result the bottleneck link gets congestion. This means that this simple
solution does not work; we need a smarter solution. One smarter solution is that when
we configure AGG1 we also configure AGG5; change it from ECMP to weighted-ECMP.
And if it splits the traffic with 500 and 100, it will redistribute the blue flow over the
network. And as a result, we don't have any congestion. So let's make a summary.
Initially we have a traffic distribution and finally to perform this upgrade we have another
– we want to give the network another traffic distribution. And either of them is
congestion-free. The only question is whether the transition period is congestion-free.
And we all know that to achieve this transition we need to configure two switches. It
sounds very simple, but actually it's not because the [inaudible] reason is that we cannot
exactly change the two switches at the same time. And the way of phasing
asynchronization is one switch must be changed first. Let's study what if ToR1 is
changed first. So during the period that ToR1 is changed but ToR5 is not yet, we have
congestion here. And one might think, "What if we change ToR5 first?" It will congest
another link. So if we omit this part. But you can already jump to the conclusion that
despite initially and finally we don't have any congestion there is no way to make the
[inaudible] transition congestion-free.
And what do we think about this? Whether we can introduce an intermediate traffic
distribution that has the property that from initial to the intermediate and from the
intermediate to final the transition is congestion-free regardless of asynchronization. But
does this kind of intermediate state exist? Yes and this is the example. We will show
later how we find it but now I want to point out that even with such a small-scale network
where we only manipulate two flows, it's already not easy for us to come up with a
congestion-free update plan. So think about the operators who are operating a thousand
or more switches and millions of flows. Obviously they need a very powerful tool to save
them from the complexity of making plans for the network updates. This is the major
motivation for why we introduced zUpdate. zUpdate stands between the operator and
the network and it keeps monitoring the current traffic distributions. And when the
operator wants to perform some update scenario, it translates the update requirements
to the constraint and the target traffic distribution.
And zUpdate will find such a target traffic distribution and then some necessary
intermediate traffic distribution to make the whole process smooth and lossless. And
based on this it will configure the network automatically. And I think the one major
contribution of this project, if we think one thing, is that despite the fact that we have
different kinds of scenarios, they actually are sharing a common functionality that's like a
system call in the network which will cause smooth traffic Distribution Transition. And
zUpdate just [inaudible] this.
>>: Is this dependent on the fact that while you're trying to figure out what these
intermediate plans are the traffic distribution isn't changing? It has to essentially be the
same from the time you start to when you finish? Is that an assumption that has to be
made here?
>> Harry Liu: Yes, it can change but if it changes we use some upper-bound to estimate
the worst case of the flow size so that we can still guarantee this is congestion-free.
>>: But there certainly could be cases where it changes enough that you, what, you
have to bail out completely? Or are you guaranteeing that there will always be
intermediate states that you can move to?
>> Harry Liu: I think – So if the traffic prediction is not accurate, yes. We will suffer from
[inaudible] congestion. However, we just – if we do the transition fastly enough and we if
we make the upper-bound, if we be conservative enough, we can have a congestionfree plan. The trick is that if you are always conservative when you predict the flow, you
will compute more steps. So it will take a longer time to do the transition. Otherwise...
>>: I mean, the longer it takes, the greater the probability that that distribution
underneath you is changing.
>> Harry Liu: Right. Right. But because you are picking a very high upper-bound of the
traffic, so when it changes [inaudible].
>>: The question [inaudible].
>>: What?
>>: [Inaudible] take an upper-bound but take a long time to do is so [inaudible].
>> Harry Liu: Yeah, yeah.
>>: [Inaudible] whether the upper-bound is [inaudible].
>> Harry Liu: If the upper-bound is very large I have to use many steps to do it and I
have a maximum number of steps. So if the numbers are too large then that means that
this time is not suitable for the update. So I just tell the operator, "Okay, hold on. I cannot
do it now." And typically what we suggest is that we do it during the off-peak hour when
the network has enough capacity and the traffic flows are not that large, so it typically
takes a shorter time to do it.
>>: So you have [inaudible].
>> Harry Liu: Sometimes, yes. We don't have to satisfy all the requirements. Sometimes
you find out that the network's [inaudible] is too [inaudible] and we cannot do it. I just tell
the operator, "We cannot do it." And to realize the zUpdate, we are still facing some
technical issues. In the following talk I will first show you how we describe traffic
distribution. And with this formulation we represent all the update requirements. And we
also define the conditions for the congestion-free transition. And with these conditions
we show how we compute ad implement an update plan.
Firstly we introduce the notion Lv,uf which means flow f's load on the link from switch v
to u. For example in this network and for the flow f, if s1 is using ECMP, the value from
s1 to s2 is 300. And if s2 is using ECMP, the load value from s2 to s4 is 150. And a
traffic distribution is just the set of enumerating all the flows and all the links. And with
this formulation it is easy for us to represent the update requirements. For example, if we
want to drain s2, we just simply require that the load from s1 to s2 is 0.
>>: Something [inaudible]: are you assuming a class topology? Or is the technique
more general [inaudible]?
>> Harry Liu: The topology is generally for – actually for [inaudible] across this kind of
data center network topology. And the constraint is not from here; the constraint is from
if the network is not symmetric and some parts are longer, some parts are shorter. So
during the transition – And later we'll see – there's no guarantee that some link will be
congestion-free. So that's the major concern. That's why we don't claim it works in the
general topology, but it does work in the typical data center topology.
>>: I mean, even in the topologies that you mention, there's a little bit of [inaudible]. You
can have different [inaudible]. Say you have [inaudible] or something. Some of your
network is [inaudible]. Requiring the topology to be completely symmetric is...
>> Harry Liu: Oh, if they're using ECMP – So you mean even in the data center the path
lengths are different?
>>: They could be. I'm not saying it is.
>> Harry Liu: If that's the case, recursively we cannot prove that are [inaudible] works in
that kind of topology. And our model works well in the typical [inaudible] class, with from
a ToR to a non-ToR [inaudible] path.
>>: In both the presentations, although this is not finished, I'm seeing this in the
previous one too, I think the way you have gone about it is you first said that "This is
what we have and we're going to try to prove and try to get to this point." Another way
[inaudible] is to flip the whole problem around and say, "In order to give you certain
guarantees, we expect these properties from the topology." So that that way you can
say, for example, how – going back to the previous problem – over-provisioned or underprovisioned, whichever you define it, the link has to be. Similarly this sort of thing is what
sort of redundancies do you need to ensure that you can always 100 percent of the time
upgrade your switches whenever you want [inaudible]? So that from a sort of network
design perspective is a very interesting question because then that sort of says, "I want
to get 100 percent reliability, how much money do I have to pay for that?" Instead what I
get here is some of the time you can make it work when there is not a lot of traffic. Other
times we maybe can make it work. You know what I mean?
>> Harry Liu: Okay, I see.
>>: I think the elements of the problems, you've already solved some of the elements of
the problem but just switch the way you look at the problem.
>>: So I reality because most maintenance work is done in off-peak hours, so you can
easily find a solution. People are unlikely to perform maintenance tasks during peak
hours or the operator just won't do it. So regarding the question...
>>: But even in that situation when the assumption is that there is actually a [inaudible].
That means 24-7 there are a certain number of hours. But my contention is if you can
claim global, there will not be a [inaudible] because you're all the time using [inaudible].
>>: So this is per data center. And also even when there is no [inaudible], today when
you perform an upgrade you know there will be some performance disruption but you
don't know how bad it could be. And with this framework it will actually not only tell you
whether it will be congestion-free but it can also tell you how much loss you will incur if
you perform it now. And you decide whether you want to do it based on how severe the
congestion would be.
>>: I know. I get that part. I'm just saying, if I was willing to throw money at the problem
to 100 percent reliability – But of course I wouldn't throw [inaudible]. I want to spend the
bare minimum money but I still want to 100 percent reliability [inaudible]. And what would
that topology be?
>>: So based on today's traffic...
>>: [Inaudible].
>>: Yeah. Based on today's traffic characteristics you don't need to pay anything in
terms of actual capacity to perform this because naturally they're just traffic evaluations.
In terms of FFC, as you said, if you look at this one type of iDFX style network [inaudible]
there are different priorities. You can trade off the loss of low-priority traffic to gain
reliability on high-priority traffic. This is the actual benefit you can get unless you think
they are equally important, use the TE scheme.
>>: So the network is computing [inaudible], right, then updates are easy. I mean there
is no problem because there won't be any loss.
>>: So even the average network utilization is low, there will still be some [inaudible]
link.
>>: Okay.
>> Harry Liu: Regarding the first question, we did have the definition for what kind of
network topology that zUpdate fits. It's in the paper but I didn't bring it into the
presentation just because it's too detailed. But that's a good question [inaudible] network
topology. Let's go on?
>>: Yeah, keep going.
>> Harry Liu: Okay.
>>: You need to speed up. You may have less than 15 minutes.
>> Harry Liu: It's enough I think. Yeah, we want to drain s2, we simply require that the
load from s1 to s2 is 0 and generally we require that, you know, for all the flows on all
the switches put nothing into s2 incoming links. And for another example when the s2
recovers we want to reinstall the ECMP over the network, and we just require that s1
equally splits the traffic. And generally we require that for all the flow of the switch
equally splits its traffic among all the other outgoing links. And I only show you two
simple examples and in the paper we formulate all the common update scenarios that
we have mentioned. And now we consider the transition. To achieve traffic distribution
transition what we do is we need to install new rules into the network, and because of
the asynchronization it takes a period to finish the installation. And during the period
there will be some cases that some switches are updated and some switches are not.
And when we look at the link load from switch 7 through switch 8 we can see that there
are 5 switches which do not impact the load on this link. And generally we know that
asynchronization in a switch creates too many potential link loads that prevents us from
analyzing whether congestion will happen during the transition.
And to handle this question we used two-phase commit which means that once we
institute and install the new rule, we still keep the old rule in the switch and ensure the
installation is finished. We indicate that the ingress switch is to make a version flip which
means that it will tag the incoming packet with a new version tag and this kind of packet
will be traded by the new rule. I think this part just answers the question of why we
require the topology and why it does not work on general topology.
>>: What happens when the coordinator of the two-phase commit fails?
>> Harry Liu: What?
>>: What happens when the coordinator of that two-phase commit action fails?
>> Harry Liu: You mean here? Then the flow will be forward by the old rule.
>>: So you'll continue to – But the switches will continue to maintain at least the state of
the old and the new but they'll be routing using the old?
>> Harry Liu: Yes. Yes. Yes.
>>: How does it clean itself up? Is there – With the new state that's never going to be
used?
>> Harry Liu: Yes, yes. If the two-phase commit fails in some part and the plan just
guarantees that no congestion happens but we cannot proceed because we got to make
it step-by-step. For one step if we cannot finish some update in some switches, we just
wait there. And the solution is that if you find this switch has a problem and cannot be
update, we've got to calculate another plan that does not touch this switch. So this kind
of configuration failure can happen and the result is that we're stuck in the middle of the
transition period. But we already guarantee that even if we're stuck there, there won't be
any congestion. The only problem is that we cannot reach the final target distribution and
we can't start the update. Yeah?
>>: I've gone blank. Okay, thank you.
>> Harry Liu: Where are we? And we proved that if we are using two-phase commit and
for each link and for each flow there are only two values and during the transition either
the old value or the new value. After considering the single-flow case, we consider the
multiple-flow case. And suppose f1 and f2 are both using two-phase commit to make the
update – And we'll still look at the link load from switch 7 to switch 8 and now it's the sum
of the two flows link load.
And we know that it has four potential values during the transition, and generally we also
know that this kind of flow asynchronization can also result in exponentially many
potential loads. And to solve this problem we start from the observation that says that
despite the fact that this load value has four potential loads, it should be no more than
the value of adding f1's maximum potential load and f2's maximum potential load. And if
we generalize this observation, we get the congestion-free transition constraint which is
saying that for each link during the transition the worst case load is that we enumerate
all the flows and add up all the maximum potential loads together. And if this worse case
[inaudible] link, it is congestion-free.
And if we have this congestion-free condition, it is obvious to show how to compute a
congestion-free transition plan. First of all we have a constant which is the current traffic
distribution, and we also have some variables which are the intermediate traffic
distribution and the final traffic distribution. And we have some constraints of our
variables. First of course are the update requirements to the target. And the secondly is
that it requires that along this [inaudible] all the adjacent traffic distribution pairs should
satisfy the congestion-free requirement. Of course for each individual traffic distribution,
we require that all of them should deliver all the traffic and that they should obey the flow
the conservation. Fortunately all the constraints are linear so we can use linear
programming to efficiently find the update plan.
After we compute the plan, we are still facing some challenges to implement the update
plan. And the biggest problem is that in data centers there are too many flows, and if we
target it to manipulate all of the flows, it typically means that we have too many variables
in our computation and it takes too long of a time. Secondly it means that we will update
either too many flow rules into the switches which cannot be accommodated by the
limited table size. And the third is that we will acquire too much overhead. The solution is
that we don't want to manipulate all of the flows; we only manipulate some critical flows
and leave other flows just ECMP. This is based on the observation that in data centers
most of the flows just work fine with ECMP and only some critical flows need to be tuned
to realize the target requirement and to avoid congestion.
And we just treat all the flows that are traversing the bottleneck links as critical flows. We
know that given a short time period there are only a small number of bottleneck links in
your data center, so the number of critical flows is only a small fraction in the total
number of flows. Of course we also consider what if some failure happens during the
transition: we can either hold back or compute a new plan based on the current situation
and the [inaudible]. And we also consider how we treat the traffic demand variation
because the traffic demand prediction will have some errors. And we just pick the upperbound to handle the variation. We also – Hi. Yes?
>>: How do you identify critical flows?
>> Harry Liu: First I pick all the bottleneck links. And all the flows that are traversing
these links will be the critical flows.
>>: What if the new link becomes the bottleneck while you're doing the transition?
>> Harry Liu: By bottleneck we mean that it has the potential ability of being congested.
I treat all these kinds of links as bottleneck links, not current bottlenecks but also
including some potential bottlenecks.
>>: Traffic is changing all the time. This is back to [inaudible]'s question. How do you
know for sure only these links will be the bottlenecks during your transition?
>> Harry Liu: Oh, I don't know for sure. I use some heuristic to find it. And I also use
testbed and large-scale trace-driven simulations to evaluate. And for the testbed we
used 22 switches to organize a small topology. The switches are [inaudible] OpenFlow
1.0. The link capacity is 10 gigabits per second. And we use some commercial traffic
generator to inject some stable traffic flows into the testbed. The scenario we play is we
want to drain AGG1. And this figure shows the real-time link load on the two bottleneck
links on the testbed, the blue one and the orange one. And then, we can see that initially
it is stable and there's no congestion. And during the transition from the initial to the
intermediate, the link load changes. However, there is no congestion. And this is similar
during the transition for intermediate to the final. However, if we do the transition directly
from the initial to the final, we will see severe congestion.
And for the simulation we used a large-scale production-level topology. We used the
traffic trace for the traffic demands. And the scenario we played is that we introduced a
new call switch into the network, and we connected it and we conducted some test new
flows to go through it.
>>: This is all within a data center, right? And presumably within a data center NTP
exists. So time sychronization exists. So why is it so hard to synchronize updates? Or
even if you allow them to flow asynchronously, how much – I mean the condition lasts
for a long time. [Inaudible] can understand that NTP is hard; you cannot synchronize
them. They can be [inaudible] hundreds of milliseconds. Within a data center, you do not
see that problem?
>> Harry Liu: I think another problem is that sometimes when you – some switch
becomes the straggler when you reconfigure it, right? So RTM cannot help with this
case. We did a straggler research during last summer...
>>: This is different, right? You're only using ECMP and [inaudible] ECMP.
>> Harry Liu: No, no. I used [inaudible] ECMP. I always try to compute a weight, so it's a
weighted ECMP case. As a result – And here I will show you the link...
>>: Is this the last one?
>> Harry Liu: Almost, yeah. Show you the link loss rate when we use four different
approaches to finish the traffic transition: zUpdate; zUpdate-OneStep which means it
has the same [inaudible] as zUpdate but it omits all the unnecessary intermediate steps;
ECMP-OneStep, it always uses ECMP and it jumps from one step; and ECMP-Planned
which means it is always uses ECMP but it tries to find a particular switch change order
to minimize the loss during the transition. The first measure is the post-transition loss
rate which means the loss rate in the final TE, final state. And we can see because
zUpdate and zUpdate-OneStep use [inaudible] ECMP, they don't have such kind of loss
that ECMP has. And the second I want to show is the transition loss rate, the loss rate
during the transition. You can see zUpdate-OneStep and ECMP-OneStep because they
omit all the unnecessary intermediate steps they are very high. And we can also observe
that ECMP-Planned has lower but is compared with zUpdate it has sufficient loss. And
we show the time spent in transition by the number of steps. You can see zUpdate has
two steps which means it has only one intermediate step. And for ECMP-Planned it has
much more steps because it needs to enforce the changed order.
And here is the conclusion. I think the primary contribution is that we think the various
network updates are sharing a common functionality; that is, SDT. And we designed
zUpdate to achieve that. And that's it.
[Applause]
>>: So any other questions?
>>: No, I think we asked enough.
>>: What are you next steps?
>> Harry Liu: I have a next step. Yeah. For the future work I still believe that people
[inaudible] operating system for the network. And zUpdate might be – not zUpdate but
maybe the traffic transition might be an important subroutine in this operating system
and that there might be some other subroutines. And actually last year when I worked
here, what we are doing is essentially we are trying to realize some subroutines such as
updater and network state service. And we also talked about the deep packet inspection
and load balancing. And another direction I want to go with is what the challenges are
just from SDN. I treat the FFC effort as the how we are trying to solve a problem that's
brought by SDN because Google and Microsoft are trying to use SDN to highly utilize
their network. But the dark side is that when the utilization is higher, it's more vulnerable
to all kinds of faults. So that's why we studied how to handle the faults. That's where
FFC comes from. And I think what the challenges are caused by SDN is another
direction. But of course just for Microsoft and other official work of course, I belive FFC
and if there's a chance I would [inaudible] effort to try and help SWAN's success.
[Applause]
Download