>> Ming Zhang: Hello, everyone. It's my pleasure to introduce Harry Liu from Yale University. He has been doing an internship at MSR for three times, and he has done a lot of great work on software-defined networking. And I'll let him start. Harry. >> Harry Liu: Okay. Thanks, Ming, for the introduction and thank you guys for attending this talk. Today I will present my research in the recent two years. Actually it was done in this build and it's about how we avoid the congestions proactively under network dynamics. So the story started from the trend that people are using more and more interactively applications. And it is a big challenge to support such kind of applications because people become very sensitive to the delays in their interactions. We know that the communication delay is one major part of the delays in the interactive applications, and if we take a closer look at the communication delay we can see it includes the propagation delay, in-network queuing delay and re-transmission delay due to the packet drops. Well, propagation delay typically is small and stable. But queuing delay and the packet drop can be significant when congestion happens. Therefore as a network provider if they want to provide better support for the interactive application, the essential question they want to answer is how they deliver the traffic without congestion. Well, this sounds like a very classical problem and it's not very hard to answer when the network is stable. However, we know that the network in reality it keeps evolving due to various kinds of faults like link failures and maintenance activities such as device upgrades. [Inaudible] these dynamics is how to answer this question becomes tricky. Well, nowadays in state of the art people propose a lot of intuitive ways to try to prevent congestions in dynamics. For example, for the unexpected faults they [inaudible] capacity overprovisioning and they try to leave a lot of room in the network. Hopefully this spare room can accommodate all the traffic spikes when the faults happen. And some of them put interactive delay-sensitive traffic into the high-priority queue in their switches. For another example, for the maintenance activity, they typically prefer to work in the offpeak hours. And before the maintenance they also make a lot of sophisticated maintenance plans, and hopefully they can reduce the impact t the application. Well, these intuitive ways sound reasonable; however, if you ask me whether they work well, I will say they do not. So the big reason is that most of them require a lot of overhead and they cannot even offer any guarantee. And this prevents us from offering a high-quality SLA for the applications. And we think why is the answer to this question still in the dark? Fundamentally it's that we don't have much foundation understanding of this question. So this is the major motivation of my Ph.D. research, all my research in MSR. In the science space what I am trying to do is to provide some systematic understandings on how we avoid congestion in network dynamics. And based on this understanding I try to deploy and design, implement and evaluate some tools for the operators for them to manage their network traffic in a safe way. And in this talk I will show all of them to you. This is an overview of this talk, and it includes two projects. Each project sells a single concept. The first concept I will show you is the Forward Fault Correction, and it's designed for handling the dynamics caused by faults. And the second concept is called Smooth traffic Distribution Transition. It's designed for handling the dynamics caused by maintenance activities. First: Traffic Engineering with Forward Fault Correction. Let's first take a brief glance at traffic engineering with [inaudible]. So given a network such as this small network with link capacity of 10 and when one group of hosts want to send traffic to another group of hosts in another location, it actually creates a traffic demand. And the network can choose whether to support it and whether to fully or partially satisfy this demand. For example, G1 wants to send traffic at rate 10 to G2. The network decides to fully satisfy this demand and it also decides how to supply this, so it decides which path it'll use and how much traffic each path carries. Of course sometimes the network can also partially satisfy a traffic demand. For example, even if G1 wanted to send G2 with the rate of 100, the network can also give him 10 and how to deliver this 10. Overall a quick summary: traffic engineering makes two decisions. First the bandwidth each flow can sent at which we call granted flow size. Second, how much traffic each path carries. But traffic engineering is very critical if we want to use the network efficiently. For example, suppose initially we only have two flows and we deliver them in this way. And later here come some new flows, and to accommodate more traffic with the existing network resources we need to update the TE and adjust the TE to handle more traffic demands. And because the traffic demands are changing frequently the TE also needs to be changed frequently. So in a report last year on average the TE update interval in Google is around 2 minutes and in Microsoft around 5 minutes. All the TE systems have very similar architecture. First they have a conceptualized TE controller and it relies on the monitoring system to give it the current traffic demands and the network topology. And after the TE decision, it will rely on the controlling system to translate the TE decision into all kinds of rules and configurations into the network and push them into the network devices. Well, this kind of architecture is clean and simple; however, it has a very strong assumption. It depends on the reliability of the monitoring and the controlling. However, they are not 100 percent reliable. For example, we all know that link failures or router failures are common, and due to these failures the network topology is changing all the time. So that means that the TE controller always has all the versions of the network topology. No one can guarantee [inaudible] second that the network topology is still valid. And another problem is that no one can guarantee that the controlling system can push all the rules in your batch and successfully configure all the devices all the way because sometimes configuration failure happens which means the controlling system might fail to configure some devices. And these kinds of faults, both kinds of faults can cause severe congestion. Let's take a closer look at them. First let's take a look at the configuration failure. [Inaudible] TE controller once you configure a router, it needs to go through some steps. Of course first it needs to go through the control network and send the configurations to the router. And the router's control plane accepts it and translates the flow rules and insert the rules into the data plane. During this process, a lot of factors can trigger the configuration failure. For example, sometimes the TE controller looses connectivity to the router and sometimes due to the overloaded memory of the CPU or software bugs, the control plane of the router stops working or translating the configuration. And sometimes the data plane is short of memory and rejects all the newly inserted rules. Last year Google reported that the rate of their configuration failure was around 0.1 percent to 1 percent which is not a small number. And these kinds of configuration failures can cause congestion. For example, we've already seen this. To accommodate some new flows, we need to make this TE update and this shows what happens. And to finish this update we need to configure three switches, and this just shows what happens when one switch, s2, fails to update and is still using the old configuration. And this link will be congested. After the configuration failures, we consider hardware failures. So we can know that a lot of factors will trigger the link failures or the switch failures such as instable power, human misconfiguration or even a loose connection of the cable. We did a larger scale [inaudible] in Net8075 and we just wanted to find the frequency of the link failures. The results show that, for example, in a 2-minute interval the probability that it has a single link failure is above 10 percent. And in a 10-minute interval the probability that it has three link failures is about 1.4 percent which is also not a small number. Well, hardware failures such as link failures and switch failures can also cause congestion. Suppose this is the initial state of the traffic engineering and the link from s2 to s4 fails. What happens next is that the first s2 will quickly detect this failure and it will perform traffic rescaling which means that it will disable all the tunnels going through this link and send all the traffic to the residual tunnels. As a result it sends all the traffic through this tunnel and the congestion happens. Let's make a summary. So what we care about are the common faults in the networks and they have two major categories. The first is configuration failures which we call control plane faults which mean that a router will take a long time or even fail to implement some new configurations. And the second type is hardware failures or what we call data plane faults which mean the link or router goes down which shuts down all the tunnels going through it. Well we want to handle these kinds of faults but we are facing a very big challenge; that is, we only know that given a moment, we are almost assured that some fault will happen but we don't exactly know where the fault is. And straightforward way to handle this challenge is that we don't even predict the faults but we react to faults. However, this reacting approach has its own problems. It's not efficient. The first reason is that reaction always happens after the congestion and it cannot provide any guarantee for the congestion. And how much we suffer depends on how quick we can face it. But unfortunately sometimes it takes a long time to finish the reaction because a reaction includes detecting the failure and recomputing TE and updating the configurations all around the network. Sometimes this takes a long time. And what's more is that sometimes the control plane failures also couple with data plane failures. For example – Please. >>: Oh, I'm sorry. Go ahead and finish your [inaudible]. >> Harry Liu: For example, some link goes down and we want to fix the congestion; however, a control plane fault happens and we cannot finish this reaction so we will suffer congestion for a longer time. While the reacting approach does not work we think about the other side. So we think about a proactive approach. By proactive approach we mean that we want to design a TE that is not only congestion-free in current situation but also robust to fault cases that can happen in the future. And that's why we introduced the concept of Forward Fault Correction in traffic engineering. And it says that.... >>: That statement is [inaudible] incomplete, right? You can do that if you no resource limitations, right? You create – So the statement that you made has to be under some constraint of how many resources you may use with that. >> Harry Liu: Yes, yes, yes. I will make sure and [inaudible]. So all the resources I used to do that will translate into the overhead we pay in the network throughput. So we suffer from some network throughput. But we will show this kind of tradeoff later and we will find something more interesting. And that's why we defined this concept and it says that we want to find a the which spreads out the traffic over the network so that no congestion can happen if only the number of faults is under K. Okay? >>: Do you characterize these router faults such that did the failures happen all at once or do they gradually happen? Like in disks, disks don't just fail right away; they usually send out some signal over time that they're failing [inaudible] and they're getting worse and worse. So there's an intelligence in the disks, smart data collection, where you can learn that the disk is going to go bad and take the [inaudible]. Is there something similar to that with networking as well? Is that where you're going? >> Harry Liu: Currently I think the failure means that the router goes down or stops working and we shut it down. >>: Does the router degrade or does it just go bad? I mean, is there a way for you to detect earlier that it's degrading forwarding traffic such that you can be smarter in probing and evaluating instead of waiting for a complete failure? >> Harry Liu: Currently we didn't observe that. From the log what I observed is that it shuts down, from the log. And if it gradually goes down, I think it can be translated into the link failure first. So... >>: I don't understand the last – When it gradually goes down it can be translated into the link failures. I think the question is that you could proactively do something if you're sort of querying these routers to see if they're getting into a bad state or not. So when you looked at the logs, did you actually analyze what led up to the failure or did you just say, "Okay, a failure is happening." >> Harry Liu: It only says the failure happened because the log is from SMP. >>: So there's no performance even really when all you have is – Okay, fine. But what Ronnie's asking that might actually be true if we had better measurement techniques, we had more [inaudible]. >> Harry Liu: Right, right. >>: But in reality most of these failures are not caused by [inaudible]; they mostly caused by software bugs. So if it happens, it happens. It was triggered by something. >>: I see. >>: It's a valuable [inaudible] we have not acted on that. We are treating them as [inaudible]. >>: Even if – You know, you might also notice that it may not be – there might not be any failures but most management tools work on sort of thresholding things, right? So you sort of see that SKUs are starting to build up or it's not being – it's not acting fast enough. That is a symptom of why that is happening. I don't know if you've looked at that or not, but a lot of the tools actually provide that information. Like smart CMCs provide [inaudible]. >> Harry Liu: Okay. Currently in this model what's important is that whether the router is problematic or not, if it's problematic all the [inaudible] switch will try to move the traffic out of their switch. So it's kind of --. >>: Not even with like the [inaudible]. >> Harry Liu: Of course we have different faults so we have different K. And maybe it's a cursory of some similar concepts in other fields, and that's right. And we get this inspiration from the FEC, Forward Error Correction, in data encoding which says that the receiver can recover all the information if only the number of lost packets is under K. Well this similarity is not [inaudible]. Actually we're sharing the same insight which says that, yeah, it's quite random and we don't know exactly where the individual faults happen. However, statistically the number of faults can be very small and stable. So if we achieve this concept in traffic engineering with a reasonable K that means our TE can be robust through the majority of the possible cases that we can have in the future. >>: So Ming just made a comment that these failures happen because of software bugs. So if you're using the router or the switches from the same vendor and they've all been updated, shouldn’t they not be that random? Right? If there's a new software update that [inaudible]... >> Harry Liu: You know, to me... [Multiple inaudible comments] >>: ...[inaudible] one, right? >>: [Inaudible]. >>: Well, either one wins. >> Harry Liu: Yeah, I mean I think from my observation the failure of the router comes from the power usually. Sometimes it just keeps rebooting or unstable power [inaudible] shutting down. So for [inaudible] sometimes it's triggered by some cases so sometimes only some individual will happen maybe. Okay... >>: I [inaudible] asking about correlated failures or like... >> Harry Liu: Yeah, yeah, yeah. >>: You haven't said what we do about [inaudible]. >> Harry Liu: And even if some failures are correlated with each other, if only the number is not matched so it's covered by K, it's still okay. Okay, FFC: this is an ambitious concept and before we show how to realize let's take a look at some simple examples to show it does exist. In this example we show the FFC for the control plane faults, and we've already seen this update before. We know that when s2 fails to update, it will have some congestion. However if we can be smarter we can have some FFC TE, and we can see that only – and this FFC is for K equals to 1. And the only difference between FFC and non-FFCC is that on this link we only send 7 new traffic. And because initially the blue flow and the green flow has 3 traffic on this link, if we 7 here and if only the 2 do not fail together at the same time, we are okay. Of course we can increase the production level. For example, K equals to 2 and in this TE we only send 4 traffic on this link and because initially it has 6 that means that even if s2 and s3 both fail to update, we're still okay. And from this we can also see the trend that when we increase the production level, the natural throughput is all going down. And it shows the fundamental tradeoff between the robustness and the utilizations. And later we will show much more interesting results. This time will first see this tradeoff. And this case shows the data plane FFC. For example, for this traffic engineering, this non-FFC because even if one link fails, it will trigger the congestion after the [inaudible]. However, if we just reorganize the traffic smartly such as this [inaudible] TE which means that – K equals 1 which means that no single link failure case will cause any congestion. For example, this. After the rescaling there's no congestion and because the topology and the [inaudible] are symmetric, you can easily verify that for any single link failure it has no congestion. Well, now let's consider how we release the FFC concept. We are facing some large challenges. The first is that we are handling a huge number of fault cases. For example, the network has [inaudible] and we want to design TE that's robust to any arbitrary K link combinations failure, that means we are handling [inaudible] K. And if we consider various types of faults together, this number will be even larger. And of course we need to consider the overhead in the network throughput because, otherwise, there is a very simple solution: we're sending nothing to the network so we reach [inaudible] robust [inaudible]. But of course that is not what we want. Here is the roadmap. We already introduced the background, the problem, the definition, the challenge of FFC. Now we come to the practical realization of FFC. First, I want you to compare two kinds of constraints. The first, of course, is FFC constraint. We see that arbitrary K faults cannot cause congestion. And then the second type of constraint says that the sum of arbitrary K variables out of n ones is within an upper bound. Okay, they must seem to be [inaudible] with each other but you can feel there are some similarities between these two kinds of constraints. And if you can feel this similarity, you can guess our key idea of solving this problem. Our key idea is that first we will formulate the FFC constraint for each individual type of failure. And they have different forms. However, what we find is that we can transfer them into a single format which is K-sum constraint. And the benefit for this transferring is that this kind of constraint can be efficiently solved by [inaudible] sorting networks. We can uniformly and efficiently solve all the types of faults for FFC. And in the next few slides I will firstly introduce this part because this is the core. We can equivalently transfer the K-sum constraint to another form. What it's actually saying is that it will require that the largest sum of K variables out of n is upper bounded. Well, because when do we have a variable we actually don't know which is larger and which is smaller so that there is a straightforward way to express this kind of constraint. We try all the combinations of K variables and then, we will have [inaudible] constraints. And with a reasonable [inaudible] the computing time of the TE will be very large. However, what if I tell you that we have a way that we can equivalently compress this number of constraints into O kn constraints? And as a result the computing time is within 1 second. So how can we do that? First let's be crazy enough to actually sort the variables. Actually we can sort the variables and at least first of all we can sort two variables. Given any two variables, we can introduce two new variables: Xmax and Xmin. And add these two linear constrained. So no matter whatever X1 or 2 is, Xmax is always the larger one and Xmin is always the smaller one. And what we want to do is extend the two cases into n cases. It says that given n variables, we want to introduce a collection of new variables Y and a collection of constraints between X and Y so that Yj is always the jth largest element in X. And to achieve this we borrow the ideas found in sorting network. This is an example of the sorting network that is widely used in circuit design. Let's first look at the legend. So it shows the gate where there are two inputs and you always put the larger one of the inputs up and the smaller one down. And we can use this gate to organize a network that takes an arbitrary number of inputs, and the outputs are always the sorted version of the inputs. You can use the blue number to make a check and it shows which [inaudible]. And we've already shown that for two variables we can achieve this kind of compare and swap gate. And if we organize it just as the sorting network, we can finally get – at least there will outputs in that representing the sorted version of the inputs. Of course what we care about is only the largest K element. So that we don't have to solve all of them, we use bubble sort and each time we bubble the largest out. And after K times we already have the K largest variable and it's done. And we know that the number of comparisons are under K times n. That's why I say the complexity is O Kn. And here I want to mention Mohit and from discussions with him I got a lot of inspiration and finally reached this point. Okay, I already finished this part and it is the core part. I will [inaudible] transformation [inaudible] because it introduces a lot of notations, but I want to mention that the key observation and the key contribution of this work is that we show no matter – for all the faults that we care about, we can always transfer it into the single format and we show that it can be solved uniformly and efficiently. Okay, previously we can consider the faults one by one or one type by one type. Finally we want to find a TE that can be robust to all of the faults at the same time. The solution is simple. We just add all the FFC constraints together and get a TE that is robust to all kinds of faults. This conclusion seems to be simple and intuitive but how to prove it [inaudible] is not that straightforward, but today I will [inaudible] this part. So we come to the evaluation. We use both a testbed evaluation and a large scale tracedriven simulation. We used testbed to simulate a global scale network, and we used eight switches. And we mapped the switches to a location in the world, and we used geographical distance to simulate the delays along the links. >>: So I haven't gotten a sense yet – You were here at Microsoft with us, right? >> Harry Liu: Mmm-hmm. >>: How much of these links are actually under-provisioned for you to be able to do – So if all the links are completely saturated or close to saturation, your technique doesn't really matter? >> Harry Liu: Yeah, I will show it later in the evaluation but what I want to say is that there are two observations. One, if the network is under-utilized, we will show that we prevent congestion and we don't have actually any loss in the network throughput because the overload [inaudible]. And even if on the one case where we want to fully utilize the network, because there are different types of traffic if we only protect at the high-priority traffic is, where high protection level versus low protection level and even we don't protect any [inaudible] traffic, in this way we will show that we will – we still gain a lot. So at least for the high-priority traffic, we don't lose any throughput but we almost reduce the packet loss to zero. >>: Right. But that comment is not – [Inaudible] you decide that you lose the lowerpriority traffic and not the high-priority. I mean, you just reconfigured flows or whatever. But my question is actually – maybe where I'm going with this is that potentially you're saying that in order to get a certain amount of reliability from the network, you actually do have to – the links have to be – or in a sense you need to create more links even if they are not --. So you have baseline at which you [inaudible] the entire network that makes sure that all the servers are going in line, right? >> Harry Liu: Right. >>: And [inaudible] not bottleneck. But beyond that you still probably need to wire more to provide this level of reliability. >> Harry Liu: Right, right. >>: Is that where you would like to go? >> Harry Liu: Sometimes we don't waste anything because, for example, for the high... >>: [Inaudible]. But anyway, go ahead. >> Harry Liu: For the high-priority traffic, we first layout the high-priority traffic and it needs some spare capacity to prevent congestion. However, this kind of spare capacity can be used by the lower-priority traffic. And finally we're still reaching a higher level of throughput and when the link got congested what we do is the packet is the lowerpriority traffic. So what do I pay? [Inaudible] what do I pay? In the sense of the natural throughput, we don't pay too much. >>: Okay, [inaudible] my point. So maybe in particular data centers you have clarity of high-priority and low-priority. But in other places if you want to kind of use the same technique, for example in enterprise networks, it's not clear what is high priority and what is low priority. So in that case I don't know what you [inaudible]. >>: Right. So in that case how much do you need to network to be over-provisioned for these techniques to work? >> Harry Liu: Let me show... >>: Or another way is how much do these links have to under-utilized in order for this to work? >> Harry Liu: So if the network load, if the traffic demands are large enough, in our evaluation we pay about 10 percent throughput to gain this [inaudible] property. If the network load is not that high, for example, we scale it down by half. And under this load we don't lose any throughput. We can always fully deliver all the demands. It's only a matter of how we layout the traffic so that it can be robust enough. >>: And going back to Victor's question: if there is no traffic prioritization and you've already highly utilized links... >> Harry Liu: [Inaudible] yeah. >>: Is there nothing you can do? I mean, what do people do in that case? >> Harry Liu: I think it's a very good question. And our argument is that without any priority, no one is running the network [inaudible] that high utilization. And if we only have one priority, people typically use over-provisioning to protect it. And our [inaudible] is to offer a better way to layout the traffic to reach a better result. Okay. And by this testbed we want to show two things. The first is that we want to show what exactly happened in realtime after a failure happened. And we show why the FFC-TE can save data loss. So let's see a specific case that includes 5 switches. And the link passthrough is 1 gigabit per second and there are 2 flows, blue and green. First we show the FFC-TE with Ke equal to 1 and it's like this. So for the blue flow it used two tunnels and equally split the traffic. And for the green flow it also used two tunnels and equally split the traffic. I omitted a number here. And one link fails. What's going on? It's like this. So let's trace the moment the link failure is at 0. At first s6 will detect this failure and then it will report it to s3. And then, when s3 knows about this failure it will perform the rescale. And... >>: Why did you choose this configuration? Is this a typical configuation? >> Harry Liu: This? >>: Yeah. >> Harry Liu: This is FFC – This is clearly showing that why it's FFC because... >>: Well, obviously. That's why I'm asking. If you can come up with an example which really highlights your – So I'm asking why did you choose this? >> Harry Liu: Oh, because this is [inaudible]... >>: Is there any backing for this? Is there any reason to say that this is actually very typical? Or are you just doing it to show FFC? >> Harry Liu: Oh, I'm showing a very intuitive case to show what happens when link failure happens, so I can give you a sense of how FFC works, how FFC can reduce – what can FFC help or what can FFC not help. So, for example, before the rescaling the s3 does not know about this failure. As a result it will keep sending traffic into the tunnel, and all the traffic will get lost. So this power loss cannot be reduced by FFC. However, what FFC guarantees is that after the scaling there is no congestion. However, if we use a non-FFC-TE which the only difference is that the blue flow uses this tunnel rather than this tunnel. When this link goes down, what happens? At first it's exactly the same as with FFC and they also suffer from this loss. However, after the rescaling it saves the loss on the tunnel and it triggers another congestion on the link. That's the key difference between FFC and non-FFC. And if we take a closer look you can see the difference clearly. For example, for the FFC it only takes us some time to do the rescaling and after the rescaling it is done. There is no need to recompute the TE or update the network. No need. And if we use non-FFC, okay, after the rescaling there is a congestion. If we want to fix it, we're going to spend a lot of time to do it. And if somehow we're lucky, we can fix the congestion quicker but sometimes we're not. And it clearly shows why we would want to use the priority approach rather than reactive approach. For the larger-scale evaluation, we choose a net topology of Net8075. And the traffic demands – We rescaled the traffic demands a little bit to show three different kinds of traffic loads. The first is the scaling factor 1.0 which means it's a well-utilized network. And when the scaling factor is 0.5 it means it's an under-utilized network or what you call the well-provisioned network. And when the scaling factor is 2, it's an overloaded network. And for the failures we used a 1 percent failure rate for configuration failures and link and the router failures according to the real trace. And we only compare the two TE algorithms: non-FFC -- the basic TE without FFC constraints -- and FFC. First we show the firsthand result when we protect all the traffic with a single protection level. And here it shows the results. We two kinds of metrics; first is the throughput ratio which means we – the FFC throughput normalized by non-FFC throughput so the larger the better. And the second is the data loss ratio, so FFC's loss of data normalized by non-FFC lost data so with this smaller is better. And we can see the results here. For a scaling factor of 0.5, which means that network is well-provisioned, we can see that we reduced the data loss by many times and at the same time we are very close to the optimal throughput. And of course when we enlarge the traffic scale, we suffer from loss of the throughput. For example, for the scale equal to 1, we lost 10 percent throughput. Okay, we think this 10 percent is too much. We want to improve it. And the intuition we want to get or how we improve it is we first studied the trade-off between the throughput ratio and the data loss ratio. And on this each point is just the solution at a specific protection level. For example, this is the non-FFC and FFC with Ke equal to 1, 2 and 3. And we can also see that this curve is also related with the traffic scales, and when the traffic scale is small we can see that we can only pay a very small fraction of throughput but gain a lot of data loss. And this occurs to me that, okay, in our network not all the traffic needs high-level protection. We need to differentiate. So we followed [inaudible] definition. We defined three types of traffic and we used three different protection levels to protect them. For the interactive traffic with high priority, we used a very high protection level. And for the elastic traffic we used medium protection level. And for background traffic we don't even protect it. And as a result we get this. We can see that if we use the different protection levels: say, at a very high protection level the high priority traffic has almost no loss. And at the same time it has the optimal throughput. And of course we can also observe that for the medium type of traffic, it has much lower data loss and only suffers 2 to 3 percent throughput loss. For the throughput, it's similar in the low priority but what we are talking about is the tradeoff so there's no free lunch. And we can see what we pay is that we get higher low-priority traffic loss. However, this might be the cheapest thing we can pay. And it also shows that FFC actually gave [inaudible] that it can twist the tradeoff and that we reach the point that it will benefit us a lot. And here we also show if we only use a priority queue, it sometimes does not work because if we use non-FFC and even if the high priority has the high priority queue, it also suffers from loss. And if we use the priority queue and the FFC together, we achieve the goal to protect the high priority traffic. >>: [Inaudible] such a big depth between the medium and the low. You go from under 20 percent to almost 100 percent? >> Harry Liu: This? >>: The one you have circled there. >> Harry Liu: Here? >>: And then the point before it or the bar is only about 16 percent and then you jump up to nearly 100 percent. It seems like a huge jump. Is there any intuition why it's so big? >> Harry Liu: You mean, between this and this? >>: Yeah. >> Harry Liu: Oh, this is protected. So the medium priority is protected so when some single link failure happens it won't have any congestion. However, the lowest level of traffic is not protected. >>: Ah, I see. >> Harry Liu: Okay, here we come to the conclusion. First we propose the FFC and we believe this is necessary in the larger-scale traffic engineering because of all kinds of faults. And, we designed an efficient algorithm to realize this concept. And then, we also performed some evaluations. Especially we point out [inaudible] if we differentiate the different type of traffic with different levels of protection, we can gain a lot. But network faults are only one type of cost that costs the network dynamics. And the second source of the dynamics is from the maintenance activities, especially in the data center network. From time to time people perform a lot of maintenance, so we also want to protect the network from congestion under this kind of maintenance. Sure. >>: Before we go on, I was just wondering: you made this comment earlier on that Microsoft realizes in 5 minutes and Google is 2 minutes. Can you say something about why we're not as good or we're twice as bad as Google? >> Harry Liu: You know, Google – When I reported the average Google actually they updated the TE whenever something happened. So when some link goes down and when some link goes up, it's always reacting to it. That's why on average that they are faster. But that does not mean that we cannot do that. We just – I think we show the curve that 5 minutes can be enough to reach a high utilization. >>: So I think Google and Microsoft [inaudible]. The Google number is from their production network. They react to all the failures versus [inaudible] right now it hasn't been deployed; therefore, we don't have the failure rate. Once you consider failure, it might be lower than 5 minutes. >> Harry Liu: I know. [Inaudible] you've already seen this a few times. Sorry for the redundancy. Let's begin. In data center network applications generate a lot of traffic and rely on the network to deliver them. So when the network changes it also impacts the traffic distribution on the network. For example, sometimes we want to upgrade the switch which requires rebooting and before that, we want to remove all the traffic out of this switch and do the rebooting and after they finish we want to move them back. Sometimes we introduce a new switch onto the network, and we want to conduct some existing flow so we go through this new switch to make use of this new capacity. And when the network topology is changing the applications are also changing their traffic demands. For example, if an application has virtual machines and these virtual machines have traffic flows with each other, when the application decides to migrate some virtual machines it also reshapes the traffic flows. And all the activities I mentioned here are network updates. And we can see that network updates are performed every day inside data centers; however, it's still a very great pain for the operators. Let us study a story of an operator whose name is Bob. One day he wants to upgrade all the devices in his network. He knows that some applications might have some concerns when he performs it, so he firstly spends a lot of time negotiating with the applications. He makes a very detailed plan and even asks his colleagues to revise this plan. And then he chooses an off-peak hour to perform this; however, even if he exactly follows his plan, he still triggers some unexpected application alerts. And what's even worse is that some switch failures force him to backpedal several times. Eight hours later Bob is still struggling because he made little progress but received a lot of complaints from the applications. He has to stop now and becomes very upset. From this story we can see the problem in the state-of-the-art when we plan the network update. First, the planning stage is too complex and we're still facing some unexpected performance issues. And it's requires human beings too much effort to perform it. Let's take a closer look because we care about the applications when we perform updates. We firstly think about what the application really wants. What the application wants is first, reachability and second, low latency. In a data center we have multiple passes so it's not a big issue for reachability. For low latency as we mentioned the key is that we don't have any congestion. However, to make a congestion-free update is very hard because later we will see that even if a small change will involve many switches. To avoid congestion we need to make a multi-step plan. We have different scenarios of the update to plan, and one plan you make for one scenario cannot be reused in another. And sometimes when you change the network the applications are also changing their traffic demands, so you've got to consider these interactions. Now I'll introduce the highlevel challenges when we want to achieve congestion-free updates. We take a case study which shows a very small-scale data center topology, and all the switches are using ECMP and their link capacity is 1000. And we highlight this link because later we'll see this is the bottleneck link on the network and that there are three flows going through this network. First a top down flow with a flow size 620. This type of flow [inaudible] goes through the network to the destination, so it gives the 620 load onto the bottleneck link. The second flow is a ToR to ToR flow. This kind of flow has multiple paths and because ToR1 is using ECMP, AGG2 receives and AGG2 is using ECMP so Core 3 gets 150. And after the traffic reaches the core, it will directly use the single path towards the destination. We can see the green flow puts a 150 load on the bottleneck link. The third flow is similar to the green flow and it goes through the network and finally it puts a 150 load on the bottleneck link. In summary we have a 920 load so initially we don't have any congestion. And now we want to perform some updates; for example, we want to update AGG1. Before that we want to drain it. What we need to first do is reconfigure the ToR1 to stop it from sending any traffic to AGG1. However, this simple reconfiguration will trigger congestion because it will redistribute the green flow over the network and as a result the bottleneck link gets congestion. This means that this simple solution does not work; we need a smarter solution. One smarter solution is that when we configure AGG1 we also configure AGG5; change it from ECMP to weighted-ECMP. And if it splits the traffic with 500 and 100, it will redistribute the blue flow over the network. And as a result, we don't have any congestion. So let's make a summary. Initially we have a traffic distribution and finally to perform this upgrade we have another – we want to give the network another traffic distribution. And either of them is congestion-free. The only question is whether the transition period is congestion-free. And we all know that to achieve this transition we need to configure two switches. It sounds very simple, but actually it's not because the [inaudible] reason is that we cannot exactly change the two switches at the same time. And the way of phasing asynchronization is one switch must be changed first. Let's study what if ToR1 is changed first. So during the period that ToR1 is changed but ToR5 is not yet, we have congestion here. And one might think, "What if we change ToR5 first?" It will congest another link. So if we omit this part. But you can already jump to the conclusion that despite initially and finally we don't have any congestion there is no way to make the [inaudible] transition congestion-free. And what do we think about this? Whether we can introduce an intermediate traffic distribution that has the property that from initial to the intermediate and from the intermediate to final the transition is congestion-free regardless of asynchronization. But does this kind of intermediate state exist? Yes and this is the example. We will show later how we find it but now I want to point out that even with such a small-scale network where we only manipulate two flows, it's already not easy for us to come up with a congestion-free update plan. So think about the operators who are operating a thousand or more switches and millions of flows. Obviously they need a very powerful tool to save them from the complexity of making plans for the network updates. This is the major motivation for why we introduced zUpdate. zUpdate stands between the operator and the network and it keeps monitoring the current traffic distributions. And when the operator wants to perform some update scenario, it translates the update requirements to the constraint and the target traffic distribution. And zUpdate will find such a target traffic distribution and then some necessary intermediate traffic distribution to make the whole process smooth and lossless. And based on this it will configure the network automatically. And I think the one major contribution of this project, if we think one thing, is that despite the fact that we have different kinds of scenarios, they actually are sharing a common functionality that's like a system call in the network which will cause smooth traffic Distribution Transition. And zUpdate just [inaudible] this. >>: Is this dependent on the fact that while you're trying to figure out what these intermediate plans are the traffic distribution isn't changing? It has to essentially be the same from the time you start to when you finish? Is that an assumption that has to be made here? >> Harry Liu: Yes, it can change but if it changes we use some upper-bound to estimate the worst case of the flow size so that we can still guarantee this is congestion-free. >>: But there certainly could be cases where it changes enough that you, what, you have to bail out completely? Or are you guaranteeing that there will always be intermediate states that you can move to? >> Harry Liu: I think – So if the traffic prediction is not accurate, yes. We will suffer from [inaudible] congestion. However, we just – if we do the transition fastly enough and we if we make the upper-bound, if we be conservative enough, we can have a congestionfree plan. The trick is that if you are always conservative when you predict the flow, you will compute more steps. So it will take a longer time to do the transition. Otherwise... >>: I mean, the longer it takes, the greater the probability that that distribution underneath you is changing. >> Harry Liu: Right. Right. But because you are picking a very high upper-bound of the traffic, so when it changes [inaudible]. >>: The question [inaudible]. >>: What? >>: [Inaudible] take an upper-bound but take a long time to do is so [inaudible]. >> Harry Liu: Yeah, yeah. >>: [Inaudible] whether the upper-bound is [inaudible]. >> Harry Liu: If the upper-bound is very large I have to use many steps to do it and I have a maximum number of steps. So if the numbers are too large then that means that this time is not suitable for the update. So I just tell the operator, "Okay, hold on. I cannot do it now." And typically what we suggest is that we do it during the off-peak hour when the network has enough capacity and the traffic flows are not that large, so it typically takes a shorter time to do it. >>: So you have [inaudible]. >> Harry Liu: Sometimes, yes. We don't have to satisfy all the requirements. Sometimes you find out that the network's [inaudible] is too [inaudible] and we cannot do it. I just tell the operator, "We cannot do it." And to realize the zUpdate, we are still facing some technical issues. In the following talk I will first show you how we describe traffic distribution. And with this formulation we represent all the update requirements. And we also define the conditions for the congestion-free transition. And with these conditions we show how we compute ad implement an update plan. Firstly we introduce the notion Lv,uf which means flow f's load on the link from switch v to u. For example in this network and for the flow f, if s1 is using ECMP, the value from s1 to s2 is 300. And if s2 is using ECMP, the load value from s2 to s4 is 150. And a traffic distribution is just the set of enumerating all the flows and all the links. And with this formulation it is easy for us to represent the update requirements. For example, if we want to drain s2, we just simply require that the load from s1 to s2 is 0. >>: Something [inaudible]: are you assuming a class topology? Or is the technique more general [inaudible]? >> Harry Liu: The topology is generally for – actually for [inaudible] across this kind of data center network topology. And the constraint is not from here; the constraint is from if the network is not symmetric and some parts are longer, some parts are shorter. So during the transition – And later we'll see – there's no guarantee that some link will be congestion-free. So that's the major concern. That's why we don't claim it works in the general topology, but it does work in the typical data center topology. >>: I mean, even in the topologies that you mention, there's a little bit of [inaudible]. You can have different [inaudible]. Say you have [inaudible] or something. Some of your network is [inaudible]. Requiring the topology to be completely symmetric is... >> Harry Liu: Oh, if they're using ECMP – So you mean even in the data center the path lengths are different? >>: They could be. I'm not saying it is. >> Harry Liu: If that's the case, recursively we cannot prove that are [inaudible] works in that kind of topology. And our model works well in the typical [inaudible] class, with from a ToR to a non-ToR [inaudible] path. >>: In both the presentations, although this is not finished, I'm seeing this in the previous one too, I think the way you have gone about it is you first said that "This is what we have and we're going to try to prove and try to get to this point." Another way [inaudible] is to flip the whole problem around and say, "In order to give you certain guarantees, we expect these properties from the topology." So that that way you can say, for example, how – going back to the previous problem – over-provisioned or underprovisioned, whichever you define it, the link has to be. Similarly this sort of thing is what sort of redundancies do you need to ensure that you can always 100 percent of the time upgrade your switches whenever you want [inaudible]? So that from a sort of network design perspective is a very interesting question because then that sort of says, "I want to get 100 percent reliability, how much money do I have to pay for that?" Instead what I get here is some of the time you can make it work when there is not a lot of traffic. Other times we maybe can make it work. You know what I mean? >> Harry Liu: Okay, I see. >>: I think the elements of the problems, you've already solved some of the elements of the problem but just switch the way you look at the problem. >>: So I reality because most maintenance work is done in off-peak hours, so you can easily find a solution. People are unlikely to perform maintenance tasks during peak hours or the operator just won't do it. So regarding the question... >>: But even in that situation when the assumption is that there is actually a [inaudible]. That means 24-7 there are a certain number of hours. But my contention is if you can claim global, there will not be a [inaudible] because you're all the time using [inaudible]. >>: So this is per data center. And also even when there is no [inaudible], today when you perform an upgrade you know there will be some performance disruption but you don't know how bad it could be. And with this framework it will actually not only tell you whether it will be congestion-free but it can also tell you how much loss you will incur if you perform it now. And you decide whether you want to do it based on how severe the congestion would be. >>: I know. I get that part. I'm just saying, if I was willing to throw money at the problem to 100 percent reliability – But of course I wouldn't throw [inaudible]. I want to spend the bare minimum money but I still want to 100 percent reliability [inaudible]. And what would that topology be? >>: So based on today's traffic... >>: [Inaudible]. >>: Yeah. Based on today's traffic characteristics you don't need to pay anything in terms of actual capacity to perform this because naturally they're just traffic evaluations. In terms of FFC, as you said, if you look at this one type of iDFX style network [inaudible] there are different priorities. You can trade off the loss of low-priority traffic to gain reliability on high-priority traffic. This is the actual benefit you can get unless you think they are equally important, use the TE scheme. >>: So the network is computing [inaudible], right, then updates are easy. I mean there is no problem because there won't be any loss. >>: So even the average network utilization is low, there will still be some [inaudible] link. >>: Okay. >> Harry Liu: Regarding the first question, we did have the definition for what kind of network topology that zUpdate fits. It's in the paper but I didn't bring it into the presentation just because it's too detailed. But that's a good question [inaudible] network topology. Let's go on? >>: Yeah, keep going. >> Harry Liu: Okay. >>: You need to speed up. You may have less than 15 minutes. >> Harry Liu: It's enough I think. Yeah, we want to drain s2, we simply require that the load from s1 to s2 is 0 and generally we require that, you know, for all the flows on all the switches put nothing into s2 incoming links. And for another example when the s2 recovers we want to reinstall the ECMP over the network, and we just require that s1 equally splits the traffic. And generally we require that for all the flow of the switch equally splits its traffic among all the other outgoing links. And I only show you two simple examples and in the paper we formulate all the common update scenarios that we have mentioned. And now we consider the transition. To achieve traffic distribution transition what we do is we need to install new rules into the network, and because of the asynchronization it takes a period to finish the installation. And during the period there will be some cases that some switches are updated and some switches are not. And when we look at the link load from switch 7 through switch 8 we can see that there are 5 switches which do not impact the load on this link. And generally we know that asynchronization in a switch creates too many potential link loads that prevents us from analyzing whether congestion will happen during the transition. And to handle this question we used two-phase commit which means that once we institute and install the new rule, we still keep the old rule in the switch and ensure the installation is finished. We indicate that the ingress switch is to make a version flip which means that it will tag the incoming packet with a new version tag and this kind of packet will be traded by the new rule. I think this part just answers the question of why we require the topology and why it does not work on general topology. >>: What happens when the coordinator of the two-phase commit fails? >> Harry Liu: What? >>: What happens when the coordinator of that two-phase commit action fails? >> Harry Liu: You mean here? Then the flow will be forward by the old rule. >>: So you'll continue to – But the switches will continue to maintain at least the state of the old and the new but they'll be routing using the old? >> Harry Liu: Yes. Yes. Yes. >>: How does it clean itself up? Is there – With the new state that's never going to be used? >> Harry Liu: Yes, yes. If the two-phase commit fails in some part and the plan just guarantees that no congestion happens but we cannot proceed because we got to make it step-by-step. For one step if we cannot finish some update in some switches, we just wait there. And the solution is that if you find this switch has a problem and cannot be update, we've got to calculate another plan that does not touch this switch. So this kind of configuration failure can happen and the result is that we're stuck in the middle of the transition period. But we already guarantee that even if we're stuck there, there won't be any congestion. The only problem is that we cannot reach the final target distribution and we can't start the update. Yeah? >>: I've gone blank. Okay, thank you. >> Harry Liu: Where are we? And we proved that if we are using two-phase commit and for each link and for each flow there are only two values and during the transition either the old value or the new value. After considering the single-flow case, we consider the multiple-flow case. And suppose f1 and f2 are both using two-phase commit to make the update – And we'll still look at the link load from switch 7 to switch 8 and now it's the sum of the two flows link load. And we know that it has four potential values during the transition, and generally we also know that this kind of flow asynchronization can also result in exponentially many potential loads. And to solve this problem we start from the observation that says that despite the fact that this load value has four potential loads, it should be no more than the value of adding f1's maximum potential load and f2's maximum potential load. And if we generalize this observation, we get the congestion-free transition constraint which is saying that for each link during the transition the worst case load is that we enumerate all the flows and add up all the maximum potential loads together. And if this worse case [inaudible] link, it is congestion-free. And if we have this congestion-free condition, it is obvious to show how to compute a congestion-free transition plan. First of all we have a constant which is the current traffic distribution, and we also have some variables which are the intermediate traffic distribution and the final traffic distribution. And we have some constraints of our variables. First of course are the update requirements to the target. And the secondly is that it requires that along this [inaudible] all the adjacent traffic distribution pairs should satisfy the congestion-free requirement. Of course for each individual traffic distribution, we require that all of them should deliver all the traffic and that they should obey the flow the conservation. Fortunately all the constraints are linear so we can use linear programming to efficiently find the update plan. After we compute the plan, we are still facing some challenges to implement the update plan. And the biggest problem is that in data centers there are too many flows, and if we target it to manipulate all of the flows, it typically means that we have too many variables in our computation and it takes too long of a time. Secondly it means that we will update either too many flow rules into the switches which cannot be accommodated by the limited table size. And the third is that we will acquire too much overhead. The solution is that we don't want to manipulate all of the flows; we only manipulate some critical flows and leave other flows just ECMP. This is based on the observation that in data centers most of the flows just work fine with ECMP and only some critical flows need to be tuned to realize the target requirement and to avoid congestion. And we just treat all the flows that are traversing the bottleneck links as critical flows. We know that given a short time period there are only a small number of bottleneck links in your data center, so the number of critical flows is only a small fraction in the total number of flows. Of course we also consider what if some failure happens during the transition: we can either hold back or compute a new plan based on the current situation and the [inaudible]. And we also consider how we treat the traffic demand variation because the traffic demand prediction will have some errors. And we just pick the upperbound to handle the variation. We also – Hi. Yes? >>: How do you identify critical flows? >> Harry Liu: First I pick all the bottleneck links. And all the flows that are traversing these links will be the critical flows. >>: What if the new link becomes the bottleneck while you're doing the transition? >> Harry Liu: By bottleneck we mean that it has the potential ability of being congested. I treat all these kinds of links as bottleneck links, not current bottlenecks but also including some potential bottlenecks. >>: Traffic is changing all the time. This is back to [inaudible]'s question. How do you know for sure only these links will be the bottlenecks during your transition? >> Harry Liu: Oh, I don't know for sure. I use some heuristic to find it. And I also use testbed and large-scale trace-driven simulations to evaluate. And for the testbed we used 22 switches to organize a small topology. The switches are [inaudible] OpenFlow 1.0. The link capacity is 10 gigabits per second. And we use some commercial traffic generator to inject some stable traffic flows into the testbed. The scenario we play is we want to drain AGG1. And this figure shows the real-time link load on the two bottleneck links on the testbed, the blue one and the orange one. And then, we can see that initially it is stable and there's no congestion. And during the transition from the initial to the intermediate, the link load changes. However, there is no congestion. And this is similar during the transition for intermediate to the final. However, if we do the transition directly from the initial to the final, we will see severe congestion. And for the simulation we used a large-scale production-level topology. We used the traffic trace for the traffic demands. And the scenario we played is that we introduced a new call switch into the network, and we connected it and we conducted some test new flows to go through it. >>: This is all within a data center, right? And presumably within a data center NTP exists. So time sychronization exists. So why is it so hard to synchronize updates? Or even if you allow them to flow asynchronously, how much – I mean the condition lasts for a long time. [Inaudible] can understand that NTP is hard; you cannot synchronize them. They can be [inaudible] hundreds of milliseconds. Within a data center, you do not see that problem? >> Harry Liu: I think another problem is that sometimes when you – some switch becomes the straggler when you reconfigure it, right? So RTM cannot help with this case. We did a straggler research during last summer... >>: This is different, right? You're only using ECMP and [inaudible] ECMP. >> Harry Liu: No, no. I used [inaudible] ECMP. I always try to compute a weight, so it's a weighted ECMP case. As a result – And here I will show you the link... >>: Is this the last one? >> Harry Liu: Almost, yeah. Show you the link loss rate when we use four different approaches to finish the traffic transition: zUpdate; zUpdate-OneStep which means it has the same [inaudible] as zUpdate but it omits all the unnecessary intermediate steps; ECMP-OneStep, it always uses ECMP and it jumps from one step; and ECMP-Planned which means it is always uses ECMP but it tries to find a particular switch change order to minimize the loss during the transition. The first measure is the post-transition loss rate which means the loss rate in the final TE, final state. And we can see because zUpdate and zUpdate-OneStep use [inaudible] ECMP, they don't have such kind of loss that ECMP has. And the second I want to show is the transition loss rate, the loss rate during the transition. You can see zUpdate-OneStep and ECMP-OneStep because they omit all the unnecessary intermediate steps they are very high. And we can also observe that ECMP-Planned has lower but is compared with zUpdate it has sufficient loss. And we show the time spent in transition by the number of steps. You can see zUpdate has two steps which means it has only one intermediate step. And for ECMP-Planned it has much more steps because it needs to enforce the changed order. And here is the conclusion. I think the primary contribution is that we think the various network updates are sharing a common functionality; that is, SDT. And we designed zUpdate to achieve that. And that's it. [Applause] >>: So any other questions? >>: No, I think we asked enough. >>: What are you next steps? >> Harry Liu: I have a next step. Yeah. For the future work I still believe that people [inaudible] operating system for the network. And zUpdate might be – not zUpdate but maybe the traffic transition might be an important subroutine in this operating system and that there might be some other subroutines. And actually last year when I worked here, what we are doing is essentially we are trying to realize some subroutines such as updater and network state service. And we also talked about the deep packet inspection and load balancing. And another direction I want to go with is what the challenges are just from SDN. I treat the FFC effort as the how we are trying to solve a problem that's brought by SDN because Google and Microsoft are trying to use SDN to highly utilize their network. But the dark side is that when the utilization is higher, it's more vulnerable to all kinds of faults. So that's why we studied how to handle the faults. That's where FFC comes from. And I think what the challenges are caused by SDN is another direction. But of course just for Microsoft and other official work of course, I belive FFC and if there's a chance I would [inaudible] effort to try and help SWAN's success. [Applause]