>> Srikanth Kandula: It's a pleasure to introduce Minlan Yu from Princeton. She graduated just five years back from Peking University in China, so she is straight out from grad school. Minlan has interned in the past with us, with Albert, Dave, and Lee in bing. She has built a measurement infrastructure that allowed her to look at TCP level stacks and collect fair amount of network and host integrated measurements that led to discovering TCP anomalies. In the same vein she has done a bunch of work and she's best known perhaps for doing Cute Algorithms that -- in routers and switches that do -- that enable better measurement and better access control policies, that kind of stuff. All right. I'll let her [inaudible]. >> Minlan Yu: Okay. Thank you for the introduction. It's my great pleasure to give the talk here. And the talk is about edge networks, those networks in companies, campuses, data centers and home. So recently there have been growing interest on these network, these edge networks. But mostly people focuses on how to make these networks bigger, faster and more efficient. And what I'm going to focus today is on the management of these edge networks. Based on yankee's group support. In most enterprise today network management takes about 80 percent of the IT budget but are still responsible for more than 60 percent of the outages. So this management is an important topic, but it's only explored for these data centers enterprise network area. In these kind of networks, ideally we hope to have the users and applications of first priority and only make the network as a secondary. So we want to make the network as transparent as the air we breathe so that it can be totally support different diversity of applications and different requirements from users. But of course today's networks are far from this goal. So instead of working on how to make today's networks easier to manage, we focus on how to redesign these networks so that it's easier and cheaper to manage. There is three challenges to manage these edge networks. The first challenge is these networks are growing larger and larger every day with lots of hosts, switches, and applications. On the other hand, people hope to enforce a lot of flexible policies for different management tasks like routing, security and measurement. Finally we only have simple switches in hand. These switches have constraints on cost and energy. But how do we use these simple switches to support both large networks and flexible policies? In this talk I'm going to present you a system that achieve all these three challenges in the same time. First let's look at two examples large networks. The first is the large enterprise networks. In these networks we usually have 10s of hundreds of thousands of posts and we have thousands of switches connecting to them. On these hosts people may run a diversity of applications. These applications may have different traffic metrics and different performance requirements. The data center networks are much larger than those enterprise networks. In data centers we usually have virtual machines running on these servers. So totally there could be millions of virtual machines and these virtual machines can move from servers to different other servers. And also we have different layers of switches connecting these servers together and on these servers people may run different applications. Yes. Question? >>: Why does the number of applications matter? If these were -- if these were like different flows, is that's what's going on or like different -- just the presence of different applications is somehow ->> Minlan Yu: So today people not only want to manage the switches and servers but also the applications. They will want to provide different routing for these applications, different QoS access control for these applications or improve the network performance for these applications. So if you have a huge amount of applications, each with different tracking metrics or difference performance requirements it's a challenging problem. And it's like more challenging the data center than in the these enterprise environment because data centers have more diverse traffic patterns and more -much higher performance requirements. And in order to support these larger networks as operators today, they also want to enforce flexible policies. Today's network operators have so many considerations. They want to improve network performance, improve security, support host mobility, saves energy, saves cost and make the network easier to debug and maintain. So in order to achieve all these management goals, operators usually have different kind of policies. The first example of policy is these operators may want to provide customized routing for different applications. For example, provide like shorter latency paths for on those applications -- realtime applications. They may also want to enforce access control rules in the switches to block the traffic from malicious users. Finally for counting and measurement, they may want to measure the amount of traffic from different users and different applications. But on the other hand, we only have simple switches. Today's switches usually have increasing link speed. Today data centers usually have 40 gigabit PS link have even more. So that means we can only process each packet with tens of nanoseconds. In order to process these packets with fast speed we can only run these very small on-chip memory. This memory is very different from the memory we have in our laptop because they are very expensive and power hungry, and as a result we can only afford to have a small size of it. But on the other hand, operators wants to support -- to store lots of state in these small memory. Just to support forwarding rules we have mailings of virtual machines to data centers, that means we may need to install one forwarding rule for each host or each switch in the small memory that could consume a lot of memory and at same time we also need to use this memory to enforce access control rules, quality of service for different applications and users. Finally, we may want to maintain different counters for specific flows. It also like requires memory. So how can we address these challenge of using this small memory to install this large amount of data? We'll build a system that to manage these edge networks. And this system the operators first specify the policies in the management system and then each management system will con configure these devices, both the switches and end hosts to express these policies. This management system will also collect measurement data from the network to understand the network conditions. So we'll build two systems. The first system works on the switches. It basically focus on how to enforce flexible policies on the switches. And the second system is called SNAP, it focuses on the operator direction, how to scale a diagnose performance problems on the end host. So a common approach I take in both DIFANE and SNAP or many of my other works is that I want to first identify those practically challenging problems. And then I want to design new algorithms and data structures. For example in DIFANE's work I provide the new data structures that can make effective use of the switch memory. And in SNAP work we provide new efficient data collection analysis to algorithms. And these algorithms and data structures wouldn't be enough if we can't validate them with real prototyping. So in DIFANE's work we actually build prototypes with openflow switches. And in SNAP work we build prototypes with both Windows and Linux operating systems. Finally we hope to really make real world impact with these prototypes so we actually collaborate within industry. For DIFANE's work we collaborate with the AT&T and try to evaluate our prototype using its network configuration data. And for SNAP work, we actually work with the bing group in Microsoft and help already deploy these tool in these production data center in bing. So first let me talk about DIFANE. DIFANE is basically -- problem is how do we scalably enforce flexible policies on the switches? So if we look at traditional network, when we buy a switch we usually buy to bundle the function together. The data plane, which is the hardware part of the switches, what actually forward the packet. The control plane, which are the brains in the switches that runs on protocols to decide where the traffic should go. Since these two parts are bundled together in the switches in the data plane they are both limited. For example, in the data plane different lenders may provide different hardware and they only support limited amount of policies. And in the control plane, since it's already designed when we buy the switch, it's very hard to manage and to configure. So in order to manage traditional network today operators only manage the network in offline fashion and sometimes they go through each switches independently and try to manage an configure them. So in order to improve the management of these traditional networks there have been two new trends. The first trend is in the data plane. We call this flow-based switches to support more flexible policies. And the second trend is in the control plane. We have these logically centralized controller to make management easier. I will talk about these two new trends in more details. The first trend is the flow-based switches called OpenFlow switches. These switches have already been implemented by different vendors like HP and NEC, and it's already been deployed by different campus and enterprise networks. So what these switches does is it basically performs some simple actions based on the rules. The rules are basically matches uncertified bits in the packet header. And the actions could be drop packets from a malicious user, forward a packet to a particular port or count the packets that belongs to the Web traffic. So all these rules can be represented in these flow space I draw here. This flow space here I'm showing two dimensional flow space. The X axis the source and the Y axis is the destination. So here this red horizontal rule shows that I want to block traffic to a particular destination and this blue line shows that I want to count packets from a particular source. And in real switches these wildcard rules are stored in a specific memory called TCAM. There are two features of TCAM. The first feature is TCAM supports wildcards. For example, the first rule here shows that X can be Y cards and Y should be exact value one. And they describe this red rule here. And the second feature of TCAM is it has different priorities. So if a packet matches multiple rules in TCAM, only the rule with the highest priority will take effect. And in today's switches there are only limited amount of TCAM because TCAM is on chip and it's very power hungry, so it's very important for us to reduce the amount of rules we store in these TCAM memory. And the second trend is that we want to have logically centralized control plane to make management easier. So now we have these big centralized brain that operators can easily configure their policies in the central place. And then this brain take a job of deploying policies of all the switches. This makes management easier but it's not scalable. And if the whole other controller is out of its fields then the whole network may be out of control. So in order to solve this problem, we present DIFANE. It's a scalable way to enforce fine-grained policies in the switches. DIFANE's idea is basically you pull back some lower stem function of the brain back to the switches and by doing that you can achieve better scalability. So before I talk about how do we enforce policies, we can think of what's a most intuitive way to enforce policies from the brain to the switches. The first solution is that we can just pre-install the rules in the switches. So the controller can pre-install the rules and when a packet arrives at the switch, the switch can match the packets with the rules and proceed it. But the problem here is it doesn't support host mobility while host move to different places that all the rules that associated with this particular host need to be installed if all the switches that the host may connect to. And usually a switch don't have enough TCAM memory to store all the rules in the network. So a different solution is proposed by Ethane's work. Its basic idea is we can install the rules on demand. When the first packet arrives at the switch, the switch doesn't have any rules so it buffers the packet and send the packet header to the controller. Then the controller will look up its rule database, find out the rules and install the rules in the switch. So that the following packets can match these rules and get processed. But the problem here is we can see the packet goes and extra long way of going through the controller and come back. And based on the experiment, even when the controller is near the switch and the controller only have a small amount of rules, it still takes about 10 millisecond to go from the switch through the controller and come back. There are two ->>: Did you say milliseconds. >> Minlan Yu: 10 millisecond. >>: Really? >> Minlan Yu: There are two ->>: That's really a long time. >> Minlan Yu: Yes, that's really long. So there are two contributors to this long delay. The first is that the switches. The switch only have very small CPU. So it can't process to these kind of buffering and sending at a very fast speed. So it has a lot of delay here. And another delay is from the oracle. Oracle need to look up the rules and try to find the rules and then install it. >>: [inaudible]. This sounds like a disc [inaudible] latency. Is the controller [inaudible]. >> Minlan Yu: And in our experiment since we make the controller really simple, most of the time a delay come from the switches. >>: Come from the switches? >> Minlan Yu: Yes. >>: Okay. >> Minlan Yu: Yeah. So if the controller getting complex it will have more time delay. And also we find out it's like hard to implement with the data switches because it's increased with complexity. The switch can't just use first in first out buffer because it has to buffer the package while continuing to handle the other package that can match the rules that already in the switches. And once it receives some rules from the controller, it has to find a way to retrieve the packet from the buffer. And finally, if there is some malicious users who send traffic to a variety of destinations through a variety of ports then you can't always invalidate the rules in the switches so that it can make all the packets that arrive at switch experiences extra long delay and also increase the overhead at the controller. So in order to find out a new solution that can address these two problems we've seen with the two approaches we hope DIFANE could await the project we built, can scale with the network growth naturally. It can address the limited TCAM entries in the switches and the limited computing resources at the controller. And also we hope to improve the per-packet performance so that we can always keep the packets in the data plane to reduce the delay of each packet. Finally we hope to have minimal modifications at the switches so that we don't need to change the data plane hardware and it's easy for -- to deploy our -- the DIFANE. And our idea is very simple. We just want to combine this proactive approach and reactive approach together so that we can have better scalability, higher throughput and lower delay. And DIFANE's design has two stages. In the first stage, the controller proactively generates the rules and distributes them to some authority switches. So in our design we still have a centralized controller. So it's still easy for operators to configure the policies there. And then this controller will represent these rules in the flow space. Then the controller partition this flow space into multiple parts and design each part to one authority switch. At the same time, the controller will distribute some partitioning information to all the switch in the network so each switch knows how to handle the packet. Which the switch receives some packets and knows which authority switch to ask for the rules. And the second stage, since the rules are already preinstalled in the authority switches we can use the authority switches to keep the packets always in the data plane and then reactively cache the rules. So here's how it works. When the first packet arrives at the ingress switch, the ingress switch redirects the packets to the authority switch. And then the thought switch can process the packets according to the rules. At the same time, the authority switch will install some cache rules so that the following packets can get processed locally at the ingress switch and directly forward it. So we can see the key difference here between DIFANE in the previous reactive approach is now the packets traverse a slightly longer path in a data plane instead of sending the packets through the slow control plane. So -- yeah? >>: [inaudible] if you have no policy for the first packet it automatically redirects to the authority switch? >> Minlan Yu: Yes. So if the -- there's no cache the rules to handle the first packet, then the ingress which will redirect the packets to the authority switch. Yes. >>: This is a detailed question, but are you doing like IP encapsulation here or redirection? >> Minlan Yu: So there are different solutions. The first solutions is you can just use IP encapsulation to these kind of tunnelling from the ingress switch to the authority switch and to the egress switch. >>: And doing a lot of that still tends to be faster? >>: 10 milliseconds. She [inaudible] budget. >> Minlan Yu: That's much faster than that. And today the switches don't do these kind of encapsulation hardware so we hope to use VLAN text to improve the performance. Yes. >>: [inaudible] you said that the reason you had 10 millisecond delay [inaudible] came from the switch. >> Minlan Yu: That's in our evaluation because we have a really simple controller. >>: Okay. >> Minlan Yu: Yeah. >>: So suppose you also had a similar kind of -- so you had a switch which was not -- which did have a powerful CPU. >> Minlan Yu: Uh-huh. >>: Now you're going to ask that switch to do a lot more. You're going to ask it to not just make a query to the sentence but you're going to ask it to do Mac and Mac or IP Mac encapsulations. >> Minlan Yu: Uh-huh. >>: And a rule lookup. And that's going to take less than 10 milliseconds? >> Minlan Yu: So rule lookup with the switch you really have to do. >>: Okay. >> Minlan Yu: If it has ->>: That's fine. So you're saying that encapsulation takes less time than sending a query to the controller? >> Minlan Yu: That I'm not sure, but based on the software evaluation it's faster than sending the packet -- buffer the packet and sending it to the controller. And we have a solution that use VLAN text that can be much faster than in the hardware rather than like full encapsulation to further improve that. But the challenge of using VLAN text is now the authority switch don't know the ingress switch address anymore. So although we can implement a tunnel in which VLAN text, we can't implement the rule caching scheme with VLAN text. So instead of doing that, we can move the caching scheme back to the controller. >>: So the -- the thing that you're improving on [inaudible] is just the post packet delay. >> Minlan Yu: Yes. >>: Nothing else? >> Minlan Yu: So basically the first packet delay would benefit from moving from software to hardware. And the second better thing we do is this kind of scalability. It's distributed instead of centralized. So it's like scale more with larger networks. So when these things approach with much larger networks in the long run can have a centralized controller. >>: [inaudible]. >> Minlan Yu: We are -- for now we are look at [inaudible] network with thousands of switches. And for that delay in the evaluation you will see that the one single centralized controller can only process like 50K flows per second. So that means for a thousand ->>: [inaudible] new flows per second, right? >> Minlan Yu: New flows per second. >>: Yeah. >> Minlan Yu: And for a network with a thousand switches then you can only support like 50 new flows per second from each switch. >>: That's a different bottleneck. The 10 millisecond. >> Minlan Yu: Yes, that's a different bottleneck. It's like more about the super bottleneck rather than this delay bottleneck. So I hope DIFANE's solution can address both the delay and the throughput bottlenecks. >>: So how much faster is it then to encapsulate and send it -- I mean, how much time does it take? >> Minlan Yu: So based on the software evaluation we use the software click to implement these kind of buffering techniques and those encapsulation is the difference like 10 millisecond versus one millisecond. >> Minlan Yu: So what is the purpose of the encapsulation, just to know where the package came from? >>: The encapsulation -- the purpose of encapsulation is to create a tunnel from the ingress switch to the authority switch. >>: Yeah, but why ->> Minlan Yu: So all the switches in the middle [inaudible] will no how to send the packet and wouldn't do another redirection. >>: These aren't visible links. >>: Why is it bad to do another redirection? I mean the redirection is what happens -- it's -- the switch looked up the packet in its TCAM and saw that the default rule said send it that way on that link. What would be the harm if all the switches did that on the way to the authority switch? >> Minlan Yu: So here when the packet arrives at the switch, the switch knows based on -- this is the direction the authority switch need to go. And that means all these switches in the middle should also know that the ingress switch hasn't processed the packet yet and need to redirect it. Right? If the ingress switch already match the packet with a cache rules then can be sent directly. So all the switch in the middle should not do anything. So all switches in this link need to know that they need to redirect and all the switches here should no -- so it's like add to the capacity and [inaudible]. >>: [inaudible] switches saying it's [inaudible]. >> Minlan Yu: Yeah. VLAN caching. >>: So in terms of the expressivity of the policies that you can have now this is -this is similar to just dealing with the controller to the bottleneck using multiple controllers? >>: Yeah. >> Minlan Yu: But if you have multiple controller, where are you going to place it? So do you have like all the packet sent through an actual network to this distributed amount of controller node and how do you distribute the state among these multiple controllers? So like one thing I can say is you can use DIFANE's technique to distribute the state among these controllers. But if we move the -these kind of states back to the switches then you no longer need these kind of actual links to go through the controller. >>: Okay. So I hear you saying there's a trick to distributing these rules. When you say like you can use DIFANE and straight to distribute -- to partition the set of rules along multiple controllers. >> Minlan Yu: Yes. >>: So that's not something that's straightforward. >> Minlan Yu: Yes. Like how do you -- so we chose to do partition instead of replication, right. A naive solution would be you let all the controllers store the whole set of rules and so whenever the packet arrives whichever controller can process it. But the problem there is like how do you direct the packet to the right controller and how does this controller maintain consistent state with other controllers? >>: Do you want to tell us what happens if after you partition the rules, let's say their authority switch goes down? >> Minlan Yu: Yes, I'm going to talk about that. Okay. So in summary the key benefits we get is like now we move the packets proceeding totally in the hardware path instead of going through the controller path. And a also it's distributed many authority switches. An extreme example is like if this rule caching action is really slow, then in our design the packets wouldn't need to rule caching scheme. So all the following packets will also get redirected through the authority switches. Okay. So now the question left is since we have a group of authority switches how do we know which authority switches -- how does the ingress switch knows which authority switch to ask for the rules? So if we look at this example, we partition this flow space into three parts and to describe this partition we just need three wildcard rules. And each rule describes the boundary of each partition. And by doing that, we can easily describe this partition in the TCAM in the switches. And so all these authority switches together are actually hosting a distributed directory of rules and it's not DHT because hashing didn't work here. But here with we customize the partition we are able to like using TCAM hardware already exist in the switches to support this partition. >>: It's not an accident that [inaudible] right? You're just matching bits here? >> Minlan Yu: Yes. >>: So the fact that that's two and four is not an accident. You couldn't do three and five? >> Minlan Yu: Yes. So basically you need -- this is kind of one rule. But in TCAM we needed like several rules to describe this. >>: Right. So you divided up -- you just matched the bits and the packet headers for ->> Minlan Yu: Yeah, it's like a traditional trick in TCAM if you have wildcard rule, how do you implement it with power tool boundaries. Yeah. Okay. So now we know that this partition information does some partition rules we can install in all the switches. And what are these authority switches? They are just normal switches. They are just storing a different set of rules called authority rules. So in general one switch can be both ingress switch and authority switch at the same time. So in general with switch we already have three sets of rules in the TCAM by storing these three sets of rules in the TCAM we can easily support DIFANE. So the first set of rule are some cache rules that exist in the ingress switch and are actually installed by the authority switches. And then the second set is some authority rules that exist in authority switches and are proactively installed by the controller. Finally there's some partition rules that exist in all the switches and are proactively installed by the controller. And there are different priorities among these rules. So if a packet matches cache rules it can be processed directly. So cache rules have the highest priority. And the partition rules have the lowest priority because if a packet doesn't match any of the cache rule or the authority rules it will certainly match one of the partition rules. And with these partition rules we can always keep the packets in the data plane. So actually implement DIFANE with OpenFlow switches and we don't need any change of the data plane, so we just install three sets of rules in the data plane. And then if a packet matches the authority rules we will send a notification to the control plane and in the control plane we implement the cache manager that sends some cache updates to the ingress switch. So at the ingress switch if it receives some cache updates it can change the cache rules accordingly. So we can see that the authority rules and the cache manager only exist in authority switches and we just need software modification for authority switches to support DIFANE. So the next question is how can we implement this cache manager? Yes? >>: Just pre a security perspective now what's going on is that each authority switch has the ability to change the rules of every other switch in the record? >> Minlan Yu: Uh-huh. >>: For its subsection. >> Minlan Yu: Yes. So and we assume that all the switches if the network are trusted and are controlled by a centralized controller. So we don't support any malicious switches. Yeah. Is. >>: Okay. >> Minlan Yu: Okay. So the next question is how can we implement this cache manager? The challenge here is there are like multi-dimensional flow space and there may be wildcard rules in different dimensions. So here I'm still showing a simple example of two dimensional flow space and here there are several rules. These rules are overlapping with each other so they have different priorities. The rule one has highest priority and then rule two, then rule three, then rule four. So if a packet matches rule three, we can't simply cache rule three because another packet that matches both rule two and rule three should take rule two's action. So instead the authority switch should generate a new rule that cover this blue rectangle space and only cache this new rule. A different case is that if a packet matches rule one then it can just cache rule one directly. So this kind of caching problem because even challenge -- more challenging when we have multiple authority switches. So here in this figure when we have two authority switches each taking half of the flow space, then where do we store rule one and rule three and rule four? So for example, for rule one, if we store them in both authority switches then these authority switches may make independent decisions what they want to cache in the same ingress switch. So there are maybe catch conflicts. So instead of storing rule one, we store two independent rules and then store two -each of them in one authority switches. So we can see that we actually with multiple authority switches we actually increase the total number of rules in the whole flow space. So this brings a challenge to how to we partition this flow space so that we can minimize the total number of TCAM entries required in the whole switches. This problem is NP-hard for the flow space that's two dimensional even more dimensions. So we propose a new solution called decision-tree based rule partition algorithm. And the basic idea of this algorithm is cut B is better than cut A because cut B follows the rule boundaries and doesn't generate any new rules. Okay. So there are better questions of how do we handle network dynamics because the operators may change the policy at the centralized controller, the authority switch may fails, and the host may move to different places. And here I'm going to talk about how do we handle authority switch failure. So we actually duplicate the same partial flow space into two or three authority switches that have scattered in different places in the network. So this kind of duplication first helps to reduce the extra delay of redirection. For example, here we have two different switches and for this ingress switch it can choose the nearest authority switch to redirect a packet to. To implement that we actually implement two different partition rules with the same flow space but direct them to different authority switches. So in the normal case for this ingress switch the first rule of the higher priority will take effect. That means the packets will get redirected to the nearer authority switch A1. Similarly for another ingress switch it will get redirect packets to its nearer authority switch. So if authority switch A1 fails then we run all these OSPF proposals across a whole network so the ingress switch can easily get these kind of switch filler events notification. And then this ingress switch can easily invalidate this partition rule so that the second repartition rule naturally take effect and it can easily redirect the packets to next nearer authority switch in the network. So in summary by duplicating authority switches we can reduce the extra delay of redirection, we can do better load balancing and fast reaction to authority switch failures. So we actually validate our implementation of DIFANE in the testbed of around 40 computers and we implement two different solutions. First is Ethane and the second is DIFANE. In Ethane setup the packets will get redirected to the controller and then the controller install the rules so that the packets can get processed in the switch. But in DIFANE's architecture the controller will pre-install the rules in authority switch so that the packets can get redirect through the authority switch and get processed directly. So in order to be fair with Ethane we only evaluate our prototype with the single authority switches. And also since the key difference between the two architectures are the first packet of each flow, so we only count the first packet of each flow. And now we increase the number of ingress switches to test the throughput of the two systems. So here I'm showing the throughput evaluation. The X axis is the sending rate and the Y axis is the throughput. And both are in log scale. Question? >>: Yeah. I'm trying to understand [inaudible] like you started out with like wanting to make networks easy to manage, right? >> Minlan Yu: Uh-huh. >>: So in what sense DIFANE achieved that? It seems you only added more complexity on top of what Ethane had. >> Minlan Yu: So Ethane had the same level of ease management, easier management compared to Ethane because operators only media to care about the centralized controller only to say like whether this group of students have access to this group of servers only need to consider all these high level policies at the central place in the centralized controller. And then DIFANE -- this is same with both architecture. And DIFANE add the extra layer below this controller to make this policy enforcement more scalable. And operators don't need to worry about this. >>: So okay. So let me ask that. So you're essentially saying it's not like you've made management easy, you made ->> Minlan Yu: So how to scale management while still maintaining -- make it easy. Like today's network is totally distributed on all the switches. It's hard to manage. But it's pretty scalable. So we hope to find the right spots that achieve both easy to manage and also scalability. >>: And you don't -- okay. Just from a management perspective, you don't expect operators to pick which switches should be authority switches and things like that? They don't need to get involved in decisions at all? >> Minlan Yu: No. Operators don't need to worry about that and we can just randomly pick a group of switches to do that. And without partitioning algorithm we can help operators decide with amount of rules they want to install and the switch capabilities they have in the network how many authority switches, how many partitions they need. >>: But is that because the worst case [inaudible] authority switch in your architecture is similar to the worst case known in the [inaudible] architecture? Because if switches are doing mor I would want like the better switches in my network to be picked out of authority switches rather than some random small switch that connects to one computer. >> Minlan Yu: So if authority wants to control like where it reinstalled the authority switches they can certainly do that. Certainly like input as a configuration. >>: [inaudible] I imagine the switches would be heterogenous and operator -you would want operators to control it. >>: Yeah, but just basic [inaudible] wouldn't matter if ->> Minlan Yu: Yeah, the input of our partitioning algorithm is like different size of memory in different switches. And then we can like make sure the amount of rules we put in each switch can fit the different size. And this is actually a good question. We had some going work on like how can we eliminate this heterogeneity across the switches totally from the operators? So we hope that we have built a like a virtual memory manager layer in the centralized controller that this can manage these kind of physical memory different switches. And then operators only need to have this virtual memory logic so that to write the operators program and where they hope to install the rules. Okay. So let's look at this throughput evaluation. Here I'm showing this figure about the ascending rate and the throughput. Both are in log scale. So we find out that with just the one ingress switch DIFANE already performs better than Ethane. This is because in Ethane's architecture the local ingress switch has a bottleneck of 20K flows so the switch -- ingress switch can only like process the buffers of packets and send the packet to the centralized controller at the rate of 20K flows per second. Yes? >>: The Y axis is not [inaudible] Y axis flows that I have never seen before per second. >> Minlan Yu: Yes. So it's like ->>: [inaudible]. >> Minlan Yu: The -- yeah you're right. Like it's number of packets per second. First packets of each flow per second. >>: First packet of flow I've never seen before per second. >> Minlan Yu: Yes. >>: So how [inaudible]. [brief talking over]. >>: New flows. >> Minlan Yu: New flows per second. So like evaluation we are only like sending new flows. Each new flow has like a single packet. So eliminate all the flowing packets. >>: Well, okay. So how [inaudible] is this in terms of [inaudible] workloads. How often does this happen that you get 100K new flows at a switch per second? >> Minlan Yu: I don't actually have real data on data center networks. I know that for Web traffic it's really like 30 to 40 packets per flow. >>: Web traffic. But this is traffic [inaudible] to the data center. So Web traffic is not an issue here. >> Minlan Yu: Yeah. It also like depends on how you define the flow. It's not like per [inaudible] flow, it's basically how fine-grained control you want to put for different flows. If you want to have only want to have control on the source address, then alike it's only about how many new source is coming to this switch. But if you want to have very fine-grained control in each applications, then it depends on like which applications is sending the new flows. >>: [inaudible] to go to packet expression which is [inaudible]. >> Minlan Yu: So in today's openflow is suppose like 10 tuples, so within these 10 tuples you can. >>: [inaudible]. >> Minlan Yu: Sorry? >>: At what rate can you do ten tuple inspection? >> Minlan Yu: It's like normal switch today. Openflow doesn't add any hardware support today. So if a switch can process one gigabit packet per second then it can still ->>: No. Processing won't deal with packet per second is different than one gigabit per second new flows are added per second, correct? >> Minlan Yu: Yes. >>: So that's your Y axis here. Your Y axis is not packets per second? >> Minlan Yu: Yes. >>: Your Y axis is number of new flows per second. >> Minlan Yu: Yes. >>: That's what it says is per second. >>: No, it's new flows. >> Minlan Yu: Yes, it's new flows per second. >>: Yeah. >>: I'm just wondering whether you actually get those new flows per second. >> Minlan Yu: So it depends on the flow definition. So I don't actually have real choice today's data center traffic what kind of new flows per second each switch may get. >>: And so the difference between the two of them is primarily because you run out of buffer space for the packets in Ethane. Is that what's going on? >> Minlan Yu: The difference is these kind of switch have to do some software processing of buffer the packet and send the packet ahead to the controller. This is data in a control plane pass, not the data plane pass. So the control plane pass throughput is only like 20K flows per second. >>: It appears to change sharply. [inaudible] what is the bottleneck resource that you're hitting at [inaudible] in the curve? >> Minlan Yu: For the egress switch it's the CPU overhead, CPU bottleneck. >>: But you're generating almost the same amount of control plane traffic as in DIFANE because you're still getting the ->> Minlan Yu: Yes. So key differences we don't need to buffer the packets anymore. >>: So it is a packet buffer? >> Minlan Yu: Yes. >>: Okay. >> Minlan Yu: Yes. Sorry. Okay. So if we increase the number of ingress switches then both systems could have better throughput but the centralized controller can only handle 50K flows per second. And with DIFANE since we are doing everything in the fast data plane we can achieve 800K flows per second with a single authority switch. And another benefit of DIFANE is self scaling with the network growth. So if we hope to have larger networks with higher throughput we can just install more authority switches in the network. So another evaluation is about how can we scale to the large amount of rules. So actually collect the access control dual traces from one campus network and three different networks from AT&T. So for all these networks we collect this access control rule data from all the switches and retrieve the network-wide rule of what kind of policy operators hope to address. >>: This is a [inaudible] question. So if I can go through like six switches from coast to coast and the 10 rules are installed in the way [inaudible] works, the 10 rules will be installed on each of these six switches, right? >> Minlan Yu: Ethane was only installing like ingress rules. >>: And the others were just forward? >> Minlan Yu: It depends on like which rule you mean. Like for this access control rule it's only ingress switch. >>: I see. So but for the second switch in that chain of six like it's a new flow for that switch, right? >> Minlan Yu: Yes. So like for forwarding rules in Ethane, like all the switching need to know how to forward the packets. So the centralized controller get the new flow from the first ingress switch it will tell all the switcher to install the forwarding. It's kind of small trick to speed it up, yeah. >>: So you do the same? >> Minlan Yu: For us, we use tunnelling so we don't have this issue. >>: But the tunnelling only works for the first packet. But does the authority switch install the rules on all the six switches or only ->> Minlan Yu: So the tunnelling also happens from the ingress switch to the egress switch. The direct tunnel between these two switch. So the forwarding rule in DIFANE is only like the destination host what the egress router is, rather than what's next the outgoing interfaces. >>: Okay. We may want to take this offline but you're not tunnelling throughput, you're not tunnelling all the packets, are you? >> Minlan Yu: I'm tunnelling all the packets. >>: Oh. >>: [inaudible]. >> Minlan Yu: Yes. >>: Clear and from then on everybody ->> Minlan Yu: Yes. >>: All the other switches [inaudible]. >>: So the first packet of flow gets a tunnel -- that gets a tunnel label that takes it directly to the authority switch? >> Minlan Yu: Yes. >>: And every subsequent packet in that flow gets a tunnel label that takes it directly to the egress switch? >> Minlan Yu: Yes. >>: Is this [inaudible] uncertain? >> Minlan Yu: Ethane does this kind of install forwarding ruling all the switches. It's not -- it doesn't use tunnelling. Okay. So basically for one evaluation we see is for these IPTV network we get from AT&T it has about 3,000 switches and 5 million rules. And the amount of authority switches we need and depends on the network size, the TCAM size and the number of rules but in general we only need like .3 percent or 3 percent of the network switches to be authority switches. So in summary, let me use this slide to summarize what DIFANE does. It's basically traditional network total had I distributed but it's hard to manage. And this new architecture comes from OpenFlow and Ethane that's used as centralized, logically-centralized controller so it's easy to manage but it's not scalable. And what DIFANE hope is to pull back this trend a little bit so that we can -- in DIFANE the controller is still in charge, so it's still easy to manage, but all the switches are hosting a distributed directory of rule, so it's month scalable. And when we talked to Cisco, he seem -- Cisco is interested in this trend from additional network to DIFANE because in their traditional network they also have these kind of limited TCAM space in the routers that's not enough to animal all the access control rules. So they view DIFANE as distributed way to handle large amount of [inaudible] rules among a group of routers to address this limited memory in the route -- in a single router. >>: So if -- when you have a large number of authority switches, so if you have 3 percent of 3,000 switches then it's 90 [inaudible] that's a fairly large thing. And if it turns out that the rules that you're having tend to be -- tending to in both directions, then in general the number of subrules that you have to create is going to go up with the product of the number of 30 switches and the number of rules. Right? So now we're looking at hundreds of millions of subgroups. >> Minlan Yu: Yes. >>: Which is going to cause you to eat TCAM memory faster than in Ethane which is going to cause cache business to go up. How much -- >> Minlan Yu: So basically you need a smart partition algorithm that can reduce the amount a of rules. Especially if there's so many overlapping rules if one area, I want to not cut the area but maintain that they are in a single switch. >>: I don't have a good feel for what these rules look like but my guess would be ->> Minlan Yu: So actually for the AT&T state our algorithm works ->>: How localized are ->> Minlan Yu: The reason is that it's mostly access control rules that want to block -- make sure the right set of customers have the right access to the source. So basically these customers are pretty independent. So there's not much overlapping among these rules. >>: So rules don't tend to be -- block off traffic of one particular source or block off traffic to a particular destination ->> Minlan Yu: Yes, it's more independent. >>: [inaudible] small. Okay. >> Minlan Yu: Yeah. But [inaudible] do that with more overlapping of rules them it's much harder to break them apart. Yeah. >>: [inaudible] particular [inaudible] let's say, you know, if you have you a -- if your rule set look like half block destination, half block source then no way you cut it is really going to help because [inaudible] rules and the consequence of that is that if you try to scale this thing you're going to eat up -- you're going to eat up TCAM nonlinearly [inaudible]. >> Minlan Yu: Yes. >>: And ->> Minlan Yu: So for that, if you have the authority switch with larger TCAM memory size, then you only need to cut less that will be -- help a lot. >>: All right. >>: One small question. How are these rules invalidated in these questions is policies. >> Minlan Yu: For non security rules we rely on time bombs. R for those rules that's really important then you may need to remember them and actually we modify them [inaudible]. >>: But if you have cache retention, you need to evict something from the cache. You've got this problem you pointed out earlier that the rules interact. Is it ->> Minlan Yu: Yes. So the worst case, if all the traffic goes through -- always goes through those harder switches. >>: Is it always safe to evict a cache line or are they sometimes interdependent from the [inaudible] switch? >> Minlan Yu: So for Ethane work when you evict it down, all the traffic of need to go through the controller so there is a potentially attack from the controller that if you send a lot of traffic it will overload the controller. But in DIFANE's way, all the traffic will go through authority switch, it's much harder to overload. >>: I guess -- my question wasn't about [inaudible]. >> Minlan Yu: Okay. >>: It was that the rules have these funny interactions you showed. >> Minlan Yu: Yes. >>: Where the rules overlap. >> Minlan Yu: Yes. >>: And you described a technique for why it's safe to send a part of a rule to an ingress switch. >> Minlan Yu: So ->>: Does that same property -- does the way you cut up the rules have the property that it's always okay for an ingress switch to throw away a single rule, or the rules, are they sometimes interdependent where throwing away one rule will ->> Minlan Yu: Yes. So when we cache rule, make sure it's independent so that [inaudible]. Okay. So let me quickly talk about the second project that is SNAP. How do we perform scaling performance diagnosis for data centers. So remember in DIFANE we achieve the scalability by rescind the division of labor between the centralized controller and the switches. So now we are hoping to rescind the division of labor between the network and the host so that we can achieve better scalability. So if we look at the search application in data centers, a packet arrives as a front-end server. This front-end server actually distribute the packets, this request layer by layer to a group of workers, and these workers were generally responses and these responses could be aggregate to layer by layer to the front-end server. So we can see that with just a single request from a single search application, it's already a lot of traffic in the data centers and, in fact, most of the servers are shared in the data centers is run on different storage, MapReduce, jobs and cloud applications. So it's really a very messy environment, even when everything works fine. So if something goes wrong that the switch may fails or the link may fail or there may be some performance problem or boxing the software component in these applications it's really hard to identify. So essentially there are challenging problem to how to diagnose performance problem for data centers. Because these applications usually involve hundreds of application components and runs on tens of thousands of servers. And another problem is there could be new performance problems coming out every day because developers keeps changing their code to add new features or fix bugs. And there could be old performance problems coming out too because of some human factors. For those performance problems that we already knows, developers without a networking background hey not fully understand. For example, in networking we have these detailed protocols like Nagle's algorithm, delayed ACK [inaudible] it could be a mystery to those developers without a networking background. And for them to work with a network stack it could be a disaster. So how does people do diagnosing in today's data centers? They first really look at these application logs. It will tell the number of requests you can process per second or the response time for each request. So developers could identify performance problems using these logs. They could find out that one percent of the request have actually long delay, more than 200 millisecond delay. So since this only happen for small portion of the request, they may think that it's not their application code problem but the network probably. But in the network side we only have very course-grained switch logs that only tells the number of bytes on the number of packets each port can process per second in the switch. So in order to identify these what really happen for these request, these switch logs don't help much. Instead the developers have to install very expensive packet sniffers and to capture large amount of packet traces. And then they want to filter out those traces that corresponds to this one percent request. And either manually use some tools to analyze these packet traces to find out what's really happened and we'll find out the root cause for these problems. But most of the time we find out that the problem actually comes from not only the notes and application code on the network side but the interactions between the network and applications. So we hope to build a tool that locates in the operating system layers the network stack and a the end host so that it can tell directly about the interactions between the applications and the network. So we don't want to use the application logs because it's too application specific. We don't want to switch logs because it's too coarse-grained and packet traces are too expensive to capture all the time. Instead if we can build a generic tool that's lightweight enough to run always in the network stack layer in the end host and it's fine-grained enough to capture all the performance problems, then it will be great. So actually build a tool called SNAP, a scalable network application profiler that runs everywhere, all the time in the data centers to capture the performance problems. So in SNAP basically the first step we do is collect some performance network stack data -- network stack data from all the connections on all the end host. There are basically two types of stack data we collected. The first type are relatively easy to handle is some cumulative counters. So these counters are like the number of packet loss, the number of fast retransmissions and timeouts that each connection have experienced or the round-trip time estimation that TCP use for congestion control, all the receiver and the amount of time the receiver is receiver window limited. For these counters we can easily read some periodically and calculate the difference between two reads. The second types of data are some instantaneous snapshots. It's relatively harder to capture because if you read them periodically you may miss some important values. So we use some Poisson sampling to make sure the snapshots we read are like statistically correct. These data are like the number of bytes in the send buffer, the congestion window size, the receiver window size and so on. With the data we collected from the network stack we hope to do the performance classifications. Yeah? >>: I have a question. I know that a lot [inaudible] don't they provide these counters for [inaudible]. >> Minlan Yu: Yes. The good thing is like Windows operating system already provide these counters and we just need to read these counters. >>: I don't mean [inaudible]. >> Minlan Yu: The NIC don't have these kind of per flow, these kind of TCP level information. They only have this package information. >>: Even the ones that actually offload, they sort of you can offload TCP. >>: TCP is never fully offloaded. >>: Whatever you offloaded, I mean presumably the sample RTT you imagine that -- >> Minlan Yu: Yes, RTT is easier to handle but like other stack information like how this -- how do we add to congestion control and what's the receiver window size on the [inaudible] yeah, it's harder to get. So we think that the stack information in operating system layer is like -- only a small amount of information but it's used for [inaudible]. >>: I guess my question is I note that parts of this stack information are being offloaded on the NIC, so the question would be sort of how much could NIC help you with collecting this information without you involving the operating system which presumably is quite busy doing other things? >> Minlan Yu: So I think if you have a way to combine the NIC and OS together it would be sorted better. For example one thing we don't really get here is like if the packets get delayed we don't really know if it's delayed in the NIC card or is Internet work. So if you have more information [inaudible]. Okay. So with this data we collected we hope to do some classification, but it's a hard job because these performance problem may come from different application code and we can't really classify all these root causes. So instead of classifying the root causes of the problems we classify the problem based on the life of the data transfer. So here is an example that the packets really start from the sender application so they send buffer to the network to the receiver and then the receiver may acknowledge the data, the transfer of the data. And at each part there may be some problems like the send buffer may be too small so it limit the throughput. The network part may have different types of packet loss events, and the receiver may either not reading the data fast enough or not ACKing the receiver data fast enough. So luckily we have already collect -- with these covers we get, we can easily tell the problem -- the problems from different stages. For example for the network, we already have the data about the number of fast retransmissions and number of time-outs to tell this network problem directly. And at the refer end, we already have the receiver window limit time and the round-trip times estimation to tell the delayed ACK problem. And now we have the number of bytes in the send buffer to tell the send buffer problems. And if it's not a problem of all these other stages then we classify it as a sender application problems. So we can find out that most of the problem we find can be directly measured using this stack data. And some relies on sampling and the others relies on inference so there's some error rate and it's evaluated in the paper. >>: So is this classification going on at realtime or is it data -- >> Minlan Yu: Yes, one of our goal is to keep the classification single enough so it can be done in realtime, it can tell directly about the problems for each connection in realtime. >>: And is this done in the root VM or in the guest VM on a -- in a data center [inaudible]. >> Minlan Yu: So if you have these kind of openness of quality in the guest VM then you can understand ->>: [inaudible]. >> Minlan Yu: Yeah. If you have the operating support there -- operating system support there, then you can get it. >>: [inaudible]. >> Minlan Yu: So after we get all these data and performance classification for all the connections on all the host we collect this data to the centralized management system. And here because -- as data center operators they usually have the full knowledge of their data center, like the topology, the routing and the mappings from connections to processes and the applications. So using all these data, we can do some cross-connection correlations. So we can find out the shared application code or shared resources like the host, the link or the switches that actually cause the correlated performance problems. So in summary SNAP has two parts. The first part is on each host we have some online, lightweight processing and diagnosis piece of code and we also have an offline, cross-connection diagnosis part to identify these correlated problems. And we actually SNAP in the real world in one of bing's data center. That data center has about 8,000 machines and 700 applications. So we ran SNAP for week and collected terabytes of data. And using this data we're able to identify 15 major performance problems and during this week about 21 percent of applications certain network performance problems at a certain time. And here's a summary of the performs result we found. The first is that we find one application that has the send buffer performance problem for more than half of the time. It's because the developers does not send buffer large enough. And also there are about six applications that have lots of packet loss events has the network layer for more than half of the time. And there are about eight applications that's not reading data fast enough at the receiver end. And the most interesting thing there is they are about 144 applications that's not ACKing data fast enough is a delayed ACK problem. >>: Is the analysis straightforward once you have the data? >> Minlan Yu: Uh-huh. >>: It is? >> Minlan Yu: So the classification part is really straightforward. It's our goal to keep it really simple so we can do it on time -- online. And then this kind of correlation part is -- involves large amount of like topology and route information data and how to do it. >>: [inaudible] analysis for one TCP flow because I'm thinking back to all the work on trying to -- like critical path analysis on TCP or [inaudible]. >> Minlan Yu: So for one connection in most of the identification is really simple. Like it's actually compared to packet trace people of developed a lot of techniques to infer from packet trace what kind of problem it is. But here, since you are getting the data directly, you can directly tell what the problem is. >>: Okay. And if they're logical problems do you have a sense what the bottleneck is, which one is a real one? >> Minlan Yu: So basically we look at the stage of the data transfer and if a connection -- for a connection it's always throughput limited. So we want to know which stage is the cause of the throughput limitation. And if it's an application that's a good sign for us because the application is not running fast enough to generate the data. But like since we have these counters if the stack so we know that which stage is causing these throughput limitation. >>: It's not clear to me how can you automatically decide when there is a bottleneck that's a send buffer that's a timeout that's a fast retransmission, what will be [inaudible] look at the log and decide [inaudible]. >>: You can do that. >> Minlan Yu: Because you already have the counters about the number of time-outs, the number of packet log. So using these counters, you can directly tell if this connection is limited or not. So the key trick is since we're observation in the -- observing in the TCP stack layer it's a state layer who handle these kind of bottlenecks. Like if the receiver end is too slow, the stack need to like send less. So the stack already have the information about whether receiver is slow or not. Yes? >>: How could this information get represented to the application? >>: [inaudible]. >>: So the application might ->> Minlan Yu: So here we use. >>: [inaudible]. >> Minlan Yu: Basically [inaudible]. >>: [inaudible]. >> Minlan Yu: They have mappings from the -- each connection to different process [inaudible] they use the information to [inaudible] the applications. But we can use that information to do the mapping too. Between connections to applications. So we can find out for one application what are the connections that they use -- using. >>: [inaudible] finish up and then we can [inaudible] questions. >> Srikanth Kandula: You have like five minutes. >> Minlan Yu: Yeah. I'm almost finished. So basically we find out delayed ACK is a very important problem in this data center. One example is that one application can only send like five records per second at some time and time 1,000 records per second at other times. The reason is that if a data has like odd number of packets, the receiver may need to wait for 200 millisecond before they send the ACK, and if the sender is waiting for the ACK it will significantly reduce its throughput. And delayed ACK was used in the Internet environment to reduce the bandwidth usage and so the interrupts. But in data center environment we hope that people can disable it because today we have much higher bandwidth environment and a more powerful servers in the data centers and we no longer need this feature in data centers and -- because it's causing a lot of problems. Interestingly when we talk with goal people and they also find this delayed ACK problem in their data centers but they find it out with the expensive packet sniffers installed on certain racks in the data center. So they are not -- they don't have a sense of how common this problem is in the entire data center. So to summarize SNAP we use a delayed ACK example to summarize SNAP. Basically we first monitor at the right place. We monitor the TCP stack layer so we understand the network application interactions and we monitor at the end host so it's more scalable and more lightweight. And also we have algorithms to identify these performance problems for each connection. For example how do we infer delayed ACK problems? And then we have some correlation algorithms that can find out problems that -across connections. For example we can identify those applications that have significant delayed ACK problems. Finally we can fix the problem -- we hope to work with operators and developers to fix these problems like disabling delayed ACK in datasets. So in summary we actually build two systems. The first system is called DIFANE, is focused on how to scalably configure devices to express the policies. And the second system is called SNAP. It focuses on how to scalably collect management data perform diagnosis on the end host. And for future work I hope to continue working on this edge network area but focuses on the interactions between the applications and the network and the platform. So the first aspect is how to make the platform to better support applications, make it more flexible using software defined networking techniques and making it more secure by combining host and network defenses to the. On the other hand, I want to look at how to make applications use the network better. I want to look at better new types of applications and see how to fit them better with this emergent type of networks like the cloud, home networks. So I have also done other researches in two main streams. The first is like how to support flexible network policies and how to build networks that support applications better. And I'm happy to talk about all these talks offline. So in summary, my research style is I hope to look at practically challenging problems and want to combine those new data structures with real prototyping so that we can solve the problems in the Internet and data center networks. Thank you. [applause]. >>: Any questions? >>: I'll pass. >>: [inaudible]. >>: [inaudible]. >>: I'll ask you a question. [laughter]. >>: [inaudible] the current apps, the number of apps [inaudible] average problems. So, yeah, this one. I must have missed it. The six apps that are having the network problem, what is the network problem that you ->> Minlan Yu: So all these problems are packet loss [inaudible] have different types of packet loss events. So we think that this application may not be written a good way so that like [inaudible] they write some synchronized write at the same time that always cause packet loss. >>: Okay. [applause]