>> Jitu Padhye: It's my great pleasure to welcome Chi-Yao to give a talk. Many of you know him already. He was the lead intern on the SWAN project last summer, and he has continued to work in the area of software-defined networks. >> Chi-Yao Hong: Thank you, Jitu. So let's get started. I'm Chi-Yao from currently at UIUC, PhD student there, and today, I'll talk about my systems for a software-defined transport and also this is tied with SWAN project and some extensions. So let's begin with a quote for fun. This is anticipated that the populous part of the United States will within two or three years be covered with network like a spider's web. Anyone want to make a guess when this is from? >> 1845? >> Chi-Yao Hong: Who said that? Who said that? Okay, any guess? Any guess. Sixties, 1960s? Okay. Okay. That's pretty much close. That's a very good answer, and this is for electric telegraph, so this is just for fun. And the code here essentially illustrates that people have designed network systems for hundreds of years already, and it's just in the past they are constrained I a very different way, right? For example, people designed wide-area networks, they have to be constrained by where the population is and where, for example, even the railroad is, to deploy the network. And today, in this talk, I will talk about cloud networking. So what's so special about cloud? Why are we redesigning the whole network for cloud? Especially, the first time you take a step back and look at today's cloud infrastructure, people have designed datacenter networks to interconnect hundreds of thousands of servers inside DC, and they deployed multiple datacenter networks across the continents, and also they built their own backbone, like today Google or Microsoft of Facebook. They have their own Internet, interdatacenter, wide-area network backbone to deliver useful traffic from one site to another site. This picture shows Google's early deployment of their inter-DC WAN. So what's so special about this? Why do we need to have a different network for this? Why can't we just use today's network protocol end and run? So I see, there's challenges and exciting products I want to address. So first is it has been a very challenging task, because it's a critical infrastructure. Essentially, this is resources that everyone is depending on. You just can't fail. You must provide high performance, especially you want to meet the service requirement, you want to make your customers happy. This is something that directly reflects your business requirement, and you want to make your customer happy, reflect your revenue automatically. So by doing a good job of optimizing the network, you are potentially keeping your customer happy. And the second challenge we want to face is also at the same time it's extremely expensive resources here. So, for example, Google spent more than $7 billion last year on their datacenter infrastructure and more than 15% to 20% of the DAC cost goes to the network equipment. So by more efficiently running your network, optimize it to have a better network resource usage efficiently, then you are potentially saving lots of money there. You are making a huge impact there. And at the same time, it's also very exciting places to do research, so we have the central, unified control of the whole network, which we don't get in the traditional Internet network. And that's great, because you can have the new network architecture to be physically deployed in these new places, and you can get more impact and huge impact by deploying the new stuff. And also, this is a recent concept. I don't need to describe SDN here at Microsoft, but the SDN idea allows you to flexibly assess the switches forwarding plan to be more easy to change the forward plan, and that's something that can be tied into this new architecture to be deployed. So let me try to step back and look at today's cloud networking and how they run the network and why it does not satisfy our goal with two key motivations is how to achieve network with very high efficiency, because it's so expensive -- it's extremely expensive. You want to save money there, to run the network very efficient. At the same time, it's also highly critical infrastructure, so you want to satisfy those user requirements. So why don't we just run today? What we have today is a soup of protocols with knobs and dials you can configure over, like TCP, BGP and so on and so forth. So, for example, a routing protocol runs across different switches and also the in-house-based protocol runs across different net hosts. So the key issue we think is there's no clear, programmable API for those protocols. Essentially, if you want to, for example, optimize your network resources, if you want to have some fitness constraints across services to maintain, for example, performance isolation across different tenants. There's no clear way for you to do that with today's protocol, and it's essentially hard to optimize your network with today's architecture. I think there's three key problems in this domain. So, one, they are not very flexible. They are mostly monolithic, and they run their predefined algorithms. You have little knobs you can tune across different things. If the algorithm they run is not what you want, you won't get the property you want. And, second, it's also very hard to reason about what's the performance they will provide you with today's network protocols. Essentially, there is a mismatch, a big gap between what today's protocols can provide and what high-level transport policies that network operators usually want to enforce. So, for example, as a multitenant datacenter network operator, this is probably something you would like to have. For example, I want tenant A to have some latency guarantee for all the services they are running across their VMs, and I want -- for tenant B, I want some bandwidth guarantee, if they are running big data, moving from one site to another site. And for the rest of tenants, I probably want something like a mix of prioritization or fairness based on their payments, things like that. Those are the highlevel policies you want to enforce, and you cannot tell today's protocol, hey, this is what highlevel policy I want and just go and optimize the network for it. You simply cannot do that. There is a big gap between what those protocols provide and what high-level policy they want to enforce. So some people are smart. They can come up with their own algorithms and say, I know this is what I want and why I want to implement the with these functions to support what I want, but the issue with that is it's still very hard to be deployed. Implementing new protocols usually requires custom-made changes at even in house or network switches or even both and make the time to market and run time very long, so it's not that practical to deploy. So to fix those issues, my vision is to make today's transport architecture more programmable. And we think this can serve as a killer application for optimizing network performance. So this is the architecture we proposed, and then we go SDT, Software Defined Transport, and let me try to give you a quick view of the system flow and how it works. Essentially, you get a network where you have black dots here are switches and blue dots are servers and machines. All right, then this architecture is tied with SDN, and what SDN gives you is simply a thin interface to the switches' data plan and allows you to change the forwarding plan, and also a logically centralized controller that talks to those devices to allow you to program the network forwarding plan. So what this gives you is essentially just low-level access to the network switches. We can change the routing table, we can change the forwarding plan. But what's the northbound API we need? It's not the northbound API. It's not network optimization functions, network optimization interface. It's not the right API we want to expose to the network operator to use. Essentially, there's some rough consensus on which protocol we want to use between the SDN controller and network devices, but there's very little consensus on what's the right whole framework. Essentially, we need is an equal system that runs across -- that tries to optimize the whole network with the goal of maximizing the network performance and make the customer more happy. So to solve this, there's another interesting, important block component we leverage in this architecture. It's also the same interface you attach to the in house to control the server sending behavior. So also we run a parallel component here we call host controller that controls the network sending behavior by either -- both allocate the rate to the end host and also collect the float demand from end host to know what's their requirement on network usage. And on top of this, we have another layer we call resource optimizer, where it's the actual places we run those interesting algorithms, resource allocation algorithms, such as how much rate I should -each service can send and which path they should choose, which path they should load balance across based on the information given by the host controller about flow demand and also the topology traffic information, network level information from SDN controller. And also, you also get the high-level utilization -- high-level utility function and transport policy from the network operators. All right. Any questions so far? So let me quickly give you some brief summary of the key results when we apply this network -- when we apply this architecture to different types of networks. Yes. >>: Are you only looking at resource management policies, or are you looking at a richer set of policies? I don't know, like traffic should flow through two [meter] boxes, or traffic should never flow through China? >> Chi-Yao Hong: Sure, sure. So for now, the whole architecture you can think of as we proposed here, there, there's an opportunity here to be implemented to support other things, and in the end of this talk, I will also talk about my future research plans, and of course, [meter] box placement is something very interesting and also highly related to the resource allocation, where it can be tied and integrated with this architecture. So there's other things that can be discussed. Like, firewall could be one thing that's very interesting. >>: But for this talk, you're going to focus on resources. >> Chi-Yao Hong: Exactly. All right. So let me try to summarize the results. So when we applied to the inter-datacenter WAN, we showed that this architecture could help us to carry 60% more traffic than today's practice, and also we have congestion-free update to ensure there's no condition that would happen during the network transition, when you do the global changes of network forwarding plans. And also we showed that this can work with limited switch members. And another scenario is we applied this architecture to the datacenter -- inter-datacenter networks where we can see a huge improvement by doing more fragmented flow scheduling based on the software control framework. We can save 30% mean flow completion time by doing a good job by optimizing the flow scheduling. And also we showed that this supports three times more deadline flows than today's practice. And we also can show the above-scalability results, where we can simply use just one single controller using the desktop, where we already will be able to scale up to 10,000 of servers with subsecond response time. Yes, Alec. >>: What's a deadline flow? >> Chi-Yao Hong: The deadline is essentially in especially for many online services, they have this multi-stages behavior where requests come in from the user. They want to respond in certain deadlines, and that makes into this operation into multiple stages, so they have certain deadlines to flows, so flows have to complete in 30 milliseconds, otherwise, the results won't be integrated into the final result getting back to the user. And that's why the context here ->>: In a way, they're based classifiers? >> Chi-Yao Hong: Yes, those usually, deadline can be specified by the service providers. Like, for example, if you're running a web search, you don't want your results getting back to the user too late, and there is a certain study that says if you are delayed for 100 milliseconds, you can drop your total revenue by 3%, 5%. Those are the Google, Amazon studies. That's why in the context of inside-datacenter resource scheduling, that's quite important to look at network-level flow scheduling deadlines, and those are people they tried to satisfy. >>: The way that this works is the network administrator will configure the deadline to the controller? >> Chi-Yao Hong: Yes, so flows are attached with deadlines, and you either meet the deadline, or it's not going to be very useful. All right. So in this -- so this is where those papers got published and the people I collaborated with, many people here, I know, and also the second part is published at SIGCOMM 2012 about how to do resource scheduling efficiently. And the last part is on my ongoing work on resource scheduling inside inter-datacenter networks, and we have some early version published in Open Networking Summit just a couple of weeks ago. All right, so this is actually too much things I can talk about due to limited time, so I want to put my focus on the following three things. One is when you apply this architecture to WAN, how to optimize the WAN, how does this architecture help us to improve the WAN efficiency. That's the big application I want to talk about. And the second thing is when we apply to the interdatacenter network, how can we first approach the scalability to more fine-grained, real-time resource scheduling based on this software resource controller to do transport rate control? And then finally, if we have time, I'll talk about my future work and the other work related. Okay, so let me try to give you two motivating examples about WAN, why are we looking at the WAN? So this is essentially the production inter-DC traffic, major from production WAN at Microsoft, and here we show the time series of the normalized traffic rate across roughly a day. Essentially, as you can see, traffic goes through peaks and valleys across time, and what we see here is today's practice, they are trying to protect the important traffic, so the important traffic will always get through, and to do so, they're protecting the peak. That's where we normalize the traffic rate here, to make sure important traffic will always get through and you won't get dropped or delayed in time. By doing so, the issue is the mean -- yes? >>: Just in comparison, so the traffic can go to zero between two datacenters? >> Chi-Yao Hong: It could, so for example, inside datacenters, some places are just small clusters or it could be a trend set point. They could have little demand to the other places. That could be the case in some kind of locations. Yes. >>: This is normalized to the peak traffic. >> Chi-Yao Hong: Yes. >>: So what is the actual utilization scheme? >> Chi-Yao Hong: Actual utilization? So actual utilization is even higher than this. I'm saying, assume the ideal bandwidth provision for this, you already see the mean utilization for this link is below 50%. >>: So what I'm saying, at your highest traffic trait, what is the actual utilization like? >> Chi-Yao Hong: Highest utilization? So in average, it's like 30% to 40% network wide, across time, of course. So one thing to fix this is to look at traffic heuristics inside the intraday essential network, and we have different classes of traffic going on here. We have background traffic. Those are replication traffic, and you know replication, moving the big chunk of indexes from one side to another side. Those sort of things can be delayed for hours without compromising their service requirement. They are moving big data there. And another type of traffic we call non-background user-facing, time-sensitive traffic. Those are the traffic related to the user. Those are user triggered or user-related traffic. You don't to delay them. You want to get -- yes, questions. >>: How do you know that you can delay the background traffic? You claim that you can definitely delay the background traffic. How do you know that's true? >> Chi-Yao Hong: So this is the enterprise we run, for example, at Microsoft. We talked to the app developer, we talked to them to see, what's their real requirement there, and we tried to do traffic classification and tried to delay, without compromising their requirement for hours. So it depends on which type of service we are talking about. Some of them, if you deliver in a day, that would be okay for them. Yes. So different types of traffic, essentially, we talked to the app, and we have the control of the whole network. We have the access to the network application developers, and that's how we get this information back. Yes. >>: Was there a more concrete definition for what was background and what was labeled nonbackground? >> Chi-Yao Hong: So one definition I can give is non-background is user facing. That you don't want to delay. You don't want to rate control. You don't want to delay them. Those are the cases you have to adapt to that and you do passive measurement to look at how much demand they have sent so far and do predictions about what they have in the future. They don't participate in this architecture. You can think of it look like that. >>: So the inter-datacenter traffic for user-facing ->> Chi-Yao Hong: Yes, yes. So, for example, like you're moving some data, like doing a copy from one site to another site for another purpose. That's something more like more time sensitive than the background traffic, where you don't care about if it happens in two hours or three hours. >>: So this is from real measurements, correct? >> Chi-Yao Hong: Yes. >>: So I'm curious why the non-background and background traffic look somewhat correlated. In fact, some of the peaks seem to be very ->> Chi-Yao Hong: It's a stack graph, so it looks correlated. If it's not, then it's less likely that it's correlated. Because this, you see when there's a spike here, there's also a spike because it moves up. So if you just look at the blue part, it's actually flat. Yes. It's a little bit bad representation we have to think up why does -- all right. Sorry. So in SDT architecture, if you are able to delay just the background traffic classes, how you guys, actually, you are able to cover, deliver the same amount of traffic, just with half the capacity, without even delaying any non-background classes traffic. So this is great, because we get huge peak reduction, and that's where you can delay your further deployment or you can potentially accommodate more traffic in the network with the current deployment. All right. So another example, motivating example I want to show you, why we need SDT architecture inside the inter-datacenter WAN is because of the efficient forwarding today. So today, the protocol that you use to run inter-datacenter WAN is called MPLS-TE, multi-protocol label switching traffic engine protocol. It's essentially a local greedy algorithm that every source router tries to find the right tunnel to satisfy the bandwidth constraint with the shortest path to the destination. And that one gives you the best solution. That one allows you to optimize the network with high efficiency. So let me take this example to show you why. Suppose we have three flows, A, B, C, here, and assume each link can carry at most one flow. And then, when flow A arrives from one to six, MPLS-TE will choose one shortest path that satisfies the bandwidth requirement, so you have two available paths here, and then you would just randomly pick one. And then flow B arrives, you would take the second-shortest path here, because the shortest one will interfere with flow A, so the bandwidth requirement cannot be met. And so does flow C. It ends taking a very long path. While in the right-hand side what we show is that if you have the ability to globally coordinate your forwarding plan, then this is what you get, where we see a much better allocation can be made, where most flows get fewer latencies and you see fewer amount of network resources and lots of capacity gets freed up, so for example, this link gets freed up, so you can potentially accommodate more traffic in the network and to make your network more efficient. >>: Is your optimal assuming that you know the future, you are able to know what the other flows are going to be? >> Chi-Yao Hong: Yes, so in this context for background traffic, for example, I can even control how fast they send, so it's just not like I know the demand, but I also can control the traffic [mattress], and that makes the whole design more powerful, in the sense that we know the traffic [mattress], and we can even control it, and we know what's the current forwarding plan, and we can coordinate the forwarding plan to optimize the network. >>: Wouldn't it be optimality criteria here, because all flows are varying their bandwidth in both settings, right? So it's a delay which you're optimizing for? What are you optimizing for? >> Chi-Yao Hong: So in both delay and also amount of resources you use, in here is, as you can see that this, the number below this is much smaller, so potentially you can -- >>: [Indiscernible]. >> Chi-Yao Hong: The link is carefully utilized. >>: So you want to minimize the number of links that are utilized? Is that what the idea is? >> Chi-Yao Hong: So this is just one example, to show you how much efficient allocation can give. In this particular example, what we're trying to do is minimize the pass length, and that gives you two things. One is latency. The second thing is fewer pass links you use, fewer network resources you use. >>: So latency I understand, and maybe this leads to what [indiscernible]. Why do you want to use -- you already paid for these network resources, and they're there. So unless you are waiting for future traffic demand. >> Chi-Yao Hong: So, yes, in this case, there could be other future demands you could not accommodate in this case, because the current allocation, given the current allocation, you cannot satisfy those, so those all get rejected. Those are not admitted. But once you change the note for the [implant] here, potentially, you can accommodate one new flow, say, coming from here to here and also coming from here to here. Those are the things you can run more efficient, in the sense of that total higher support you can satisfy, you can provide in the same network, without adding new capacity to the network. All right. So next, I'm going to tell you how we used SDT architecture to apply to the inter-datacenter WAN and what's the key challenges there we faced. So one particular challenge here is when we apply SDT to the inter-datacenter WAN, we want to run the network more efficiently in the sense that we want to globally coordinate network forwarding plan and also the sending rate in a very short time, say, every five minutes. Every five minutes, I want to update the network. And one key challenge is, hey, interdatacenter WAN has lots of flows going out and lots of traffic going out. Can we scale up to schedule so many flows in the network, and how do we come up with the scalable system design to make this architecture feasible? So there's a couple solutions we applied here. One is essentially to make the network as a hierarchy. So, essentially, this is not some design with one controller that talks to many servers. That won't scale. That won't be practical. Essentially, we have multiple layers happening here, for example, between the high-level controller and also the end host. We have the service agent nested between to get the traffic request, for example, from end host, and then aggregate the demand's request and then send it to the central controller, essentially multiple layer and try to make the system more scalable with hierarchy. And second system, we do aggregation. So we are not solving -- if you look at the controller is solving the flow-scheduling problem, we are not solving at the physical layer. We are looking at the abstraction of the whole graph, where each node is a site is a datacenter, which is actually tens of switches, and each link could be hundreds of cables go around from one DC site to another site. So we're looking at just the graph level, where the nodes are DCs and links are total capacity to go between two sites. And also, we divided flow into different groups, and with defined by the source destination, the priority topples. So if they are sharing the same source and destination and same priority, same service requirement group, they will be aggregated as the same flow to be computed from the TE. >>: And the source and destination there use IP addresses? >> Chi-Yao Hong: Here, it means datacenter or each node. >>: Oh, datacenters. I see. >> Chi-Yao Hong: Yes, source node and destination node. So we are aggregating lots of actual DCB connections there. >>: So how many switches do you have? >> Chi-Yao Hong: So the target scale here is number of nodes is roughly like 40 to 100, so you can do the math, like couple hundreds. >>: Even after the aggregation? >> Chi-Yao Hong: After the aggregation, the number of nodes will be like 40 to 100, yes. That's the ->>: Without the aggregation, the single switch would correspond to? >> Chi-Yao Hong: You have one order of magnitude larger in terms of number of nodes. >>: But if this across datacenters, then wouldn't you need to have one switch per datacenter? >> Chi-Yao Hong: No, we use multiple switches in parallel. >>: Okay, that's you grouped those together, basically. >> Chi-Yao Hong: Yes. If they sit in the same side, we group them together. All right. And there's another thing about scalability issues, it's about algorithms, how can we quickly compute the resource allocation? So essentially what we do is, we have multiple priority classes, and we do things class by class. We first allocate resources to the highest-priority class and services, and then take the remaining bandwidth to further allocate to the second-highest priority, one and so on. Well, within each class, there could be multiple flows that are competing with the resources, and what we do is we want to maintain the weighted max-min fairness for the services within each class. >>: So I definitely understand the scalability challenge, but I was wondering, what's the bottleneck resource on this one controller? Is it the amount of memory it has? Is it the CPU speed? What is the thing that you actually run off in order to -- so this slide presents a list of techniques to let this network scale, because without these techniques, then the problem starts. >> Chi-Yao Hong: Sure. >>: My question is, what would be the limiting resource on this one controller when it doesn't scale? >> Chi-Yao Hong: Sure. That depends on your design, of course. If you do, say, central controller talks to every other servers, then there is an obvious bottleneck happening at the controller, where you simply cannot take that many requests to compute in time. >>: Because I don't have enough RAM, or I don't have enough cycles to handle? What is the ->> Chi-Yao Hong: So the key thing is, if you don't do those aggregations, you don't do those hierarchies, then what you are solving is essentially the physical layer of the whole network, where you have tens of thousands of links and potentially it's up to 1,000 switches. Those are essentially large networks you need to consider, and also the server side will send you -everyone has a flow. Each flow is from one server, one IP address to another IP address, potentially has tens of millions of flows. You essentially cannot solve this in five minutes. That's essentially very hard to do in terms of providing fairness. We'll talk more about how to push this scalability limit toward more fine grain in the later of the talk. But here we do aggregation just for scalability concerns. All right. So computing max-min fairness is something we found very hard in practice, and today's solution takes up to minutes at our target scale, like 50 to 100 nodes network, and that's essentially taking too long for us to be useful, so what we do is we do approximation. We don't want to find the exact solution. We want to find an approximated solution. And here what we do is, we divided all the demands into multiple stages based on the amount of demand they have, and within each stage, we have a certain upper and lower bound on the rate we can allocate it to the flow. For example, in the first state, we have the lower bound is zero and the upper bound is alpha here. And then what we do is, we run this commodity MCFs over multi-commodity flow solver. The stand-out solver will help you to compute the max flow for the multi-commodity requirement, while we give preference to the shorter path. So after this has happened, we get an allocation rate to each flow, hey, this is how fast they can send, given the constraint of this upper and lower bound. If the flow gets saturated, which means that they cannot get what they want in each stage, then we think those are the flows. They got freezed. Then we will fix their sending rate at this stage. They will still participate in the next couple of stages of computation. It's just their flow rate would be fixed. Yes, Peter. >>: This you only do for the background flows, right? >> Chi-Yao Hong: So this depends on which class you get congested. So, for example, you have multiple classes. Then in the first, highest-priority classes, there is no -- everything can be satisfied, so you don't have to worry about this. This will happen only when the congestion happens, you cannot satisfy all the people. That's where fairness becomes an important issue. Yes. >>: So the priority one, they get everything. Then you run priority two to get fairness on the remaining capacity. >> Chi-Yao Hong: Yes. If they also get everything they want, then we don't have to run this, until it's not fully satisfied. >>: So the assumption is that the -- because you have priority in the network, then if there is a spike in the high-priority guys. >> Chi-Yao Hong: You use the priority two. >>: The capacity, but then you recompute, hoping that that doesn't change too much in those five minutes. >> Chi-Yao Hong: That's correct. >>: If your high-priority traffic oscillates, then this might not work so well. >> Chi-Yao Hong: Yes, yes, so we adapt to that, and what we observe in our evaluation is they are mostly predictable in the five-minute short timeframe. >>: So can you explain what this phrasing, saturated flow rate means, because I'm [indiscernible]? >> Chi-Yao Hong: Sure. So you do allocation. You compute the rate based on MCF solver, and then you have two upper bounds here. One is the flow demand, and the other one is this stage upper bound. And look at the minima. That's the real upper bound you get. If you hit the upper bound, then you are not saturated. Otherwise, you are saturated. Saturated means there is a bottleneck in the network, either because of the link that is saturated, or you're competing with the other flows and you can ramp up to all you want, and those are the rates you eventually allocate to that flow. Okay? And then we run, of course, across multi-stages, and then eventually all the flow gets saturated. That's the rate we're going to allocate the flows. And, theoretically, we show that this is an alpha approximation algorithm, so you can explicitly trade performance for time by tuning the alpha parameters here. And we show the allocated rate is deviated by that max-min fairness rate by at most an effect of alpha. And practically, we found in an average case, with alpha equals two, we have two approximations. In the worst case, you can deviate by a factor of two. And we found most of the flow was still deviated less than 4%, and average case, empirically, we found it comes very close to official rate. It takes only subseconds, as compared to previous solutions, where it could take minutes to get exact solutions. So this is one result I want to show. The y-axis here shows the relative deviation from max-min fair rate, and where we show two solutions. One is SDT with alpha equals two around the algorithm I told you, and another one is MPLS-TE fairness notion. And the thing we observe is SDT comes very close to the max-min fair rate, while MPLS-TE the flow rate can deviate from the max-min fair ratio rate significantly and unboundedly. Yes? >>: [Indiscernible] non-background first, do they have any deadlines to meet, specifically? >> Chi-Yao Hong: Yes. It depends on different types of service. They could have the [indiscernible]. So, for example, some lines could have three hours, things like that, and for now, currently we look at short-term resource scheduling and try to ->>: [Indiscernible]. >> Chi-Yao Hong: Not really, not really. Here, what we look at is just next five minutes and how we can efficiently schedule then, based on some fairness. And the hope here is if you've got your official rate, more than likely you get to meet your deadline, in the sense that you won't get starved. If there's no fairness notion, I just try to maximize the network total resources, then it's very likely you are sending a flow that takes a very long path, which is bad for many people, so you won't get resources at all. You get very unfair. You will miss your deadline. So implicitly, we are trying to solve that deadline problem, but for now we just look at the short-term allocation, not the long-term scheduling. That's something that it could be very interesting to look at in the future. Very good point. Thanks. All right, let's move on. Another interesting challenge I want to mention is how do we do these congestion-free updates in the network, where if you think about what we are doing is -- like in MPLS-TE, we are doing global forwarding plan changes, global coordination. That sounds a little bit scary, because you're changing lots of tunnels, shifting traffic a lot from one side to another side, and that's something we should be more careful in the sense, if we don't do these things carefully, there could be severe transient congestion could happen during the network transitions. So the question here is how to update the forwarding plan without causing the transient conditions. So one example I want to show why this is a tricky problem to solve, so here I take the example where we have two flows in the network A and B, and each link can carry at most one flow. That's the assumption here. So we want to move from the initial state in the left-hand side to target state in the right-hand side? So how would you do is -- the key issue here is the network itself is not just a single machine, right? You cannot change all the flows in an atomic fashion, so sometimes you can change flow at one place but not the other places. So if where, the flow A got moved first, then you have transient congestion happen at this link. On the other case -- in the other case, you have flow B got moved first. You will still see congestion. So, essentially, there is no feasible solution in this example I showed you. There is no feasible solution that is not violating the bandwidth constraint during the network updates. So what do we do? The solution here we take is to leave a small amount of scratch capacity as a slack on every link. So what this is telling you is that you have, say, a small amount of scratch capacity S, say, one-third of the link capacity here. So all the flow can take up to two-thirds of the total capacity here, and then you leave one-third for updates. So now there's a feasible solution I can easily show. First, we move half the flow B, the top half, and then you move all the flow A to the bottom half, and then you move the rest of flow B to the top half, and you're done. So at any stage, there's no condition happening. So this is great, but this is just one example, and we want to ask, does this slack always guarantee you there's a congestion-free update exists in the network. So we prove that, yes, if you leave a slack S in the network, and we show there will be a congestion-free update sequence within 1/S minus 1 steps, and where each step here could consist of multiple updates, it's just their order can be arbitrary. So how do we -- empirically, how do we find this? It exists, but what's the algorithm we use to find it? We'd run a linear programming based solution to find this, and the key rate variable here we call Bi,j,s, where that's the rate, it should stand on the tunnel pass [indiscernible] essentially take multiple passes to run across. So the input we got is the edge at zero is initial state, initial forwarding plan, and also that Bi,j,k is the target state. And we want to find the intermediate state to run across multiple stages from stage zero to stage K, and we want to make sure there's no congestion that will happen during any stages. So the key constraint here, we try to protect the congestion constraint -- congestion-free constraint -- is the following. Essentially, this protects the worst-case scenario. What's the worst-case scenario, is if you look at every path tunnels, and there's increase or decrease of flow rate, right? And the worst case you are going to protect is for all the tunnels they get increased flow rate, they have already updated, so they would increase their flow rate, while for the rest of them who get decreased flow rate, they haven't. So that was the worst-case scenario we tried to protect, and with that, we ensure there's no congestion will happen in the network. And we also showed that this would give us, at most, as many stages as the upper bound we give. So this is great. Now, we have no congestion will happen in the network, by leaving a small amount of a scratch capacity. But even 10% scratch capacity is a weight of resources. It's lots of money. You don't want to water. You don't want to leave 10% unused in a normal time, because that's not efficient. So what we do is we classify traffic into different classes, and then we do things differently, to be able to utilize all the network capacities. So, for example, for non-background traffic, we want to ensure they will always deliver in time, so we want congestion-free update for them. So for them, when we do allocation, we will allow them to use only up to 90% of the total capacity in the network. In other words, we leave 10% slack to them. And while for the background traffic, which is lower-priority classes in the network, we can allow them to use all the capacity in the network, so there is no wasted. While we will ensure that the congestion will only happen with the background cases, because of the priority queue protection inside the network. So I want to briefly tell you this is the evaluation prototype we built, and we have 16 OpenFlow switches, a BigSwitch OpenFlow controller and servers and routers here, and we do both prototype evaluations to look at packet-level behavior to see how much congestion happened during the update. We also do data-driven evaluation from getting the production inter-datacenter WAN traces to around that scale, to see how does this design scale up. So I want to give a quick overview of how this system workflow -- so at normal time, we periodically collect the full demand from the end host. Those are the servers inside the DC, about their full [demand], send it to another datacenter, and also we collect the topology and traffic information from the SDN controller. And based on that, periodically, we compute the resource allocation to see how much resource we want to allocate and what's the forwarding plan we want to change. And if there is enough gain, we would want to do the actual network update, then this is what we do. We first compute the congestion-free update plan, what's the update plan we want to move in multiple stages, more multiple stages, to update the network without causing the congestion. Okay, so yes? >>: You said for scalability you do disaggregation. >> Chi-Yao Hong: Yes. >>: How does this show up here? Is it that the host controller is distributed? It's collecting inputs from the host? >> Chi-Yao Hong: This is kind of abstraction, so hierarchy is not showing here. In fact, hierarchy is up here. >>: Is it hierarchy from the host controller or also for the SDN controller? >> Chi-Yao Hong: For our current implementation, the host controller has two layers and SDN has just one layer, and that's the scale we think we can scale up to today's size. All right, and next step, you're going to notify the services who get decreased allocation, hey, they can slow down now, and that's because we want to start the network updates, and so you first slow down those who got decreased allocation. And once this is done, you do the actual work to change the network forwarding plan to change the load balancing across different tunnels fraction, and this coupled take multiple steps, of course. And once the network configuration is done, you notify the rest of the services who got increased flow rate, hey, they can start sending faster, and you are down for this [indiscernible], and then you go back to the first step and to periodically recomputed the service requirement and see if there is another [gain] to do another update. So just want to quickly give some evaluation results about this. We looked at the throughput, total throughput, in a network-wide aggregate across all the flows, and this is relative to optimal solutions, where we assume that it's our goal to get the flow demand with zero delay and he can control the network forwarding plan with zero delay, as well, so there is no transient congestion that would happen. You don't have to worry about the scalability of the network, so those are the optimal we try to compete with, and we show that the SDT design gets you near-optimal solutions in this context, 99% of that in a particular setting. But compared with the MPLS-TE compared to this practice, we can carry 60% more traffic through this practice. And try to decouple the benefit a little bit and look at if we don't have the ability to do in-house rate control, and then we still get 20% more traffic as compared to MPLS-TE, and this is interesting in the case where you don't have the ability to control the age. You still get a reasonable amount of benefit here, with this architecture. >>: I think you mentioned this about the methodology. So what topology did you use? >> Chi-Yao Hong: This is using the real topology given by today's inter-DC WAN. Yes. Another result, I want to bounce it off you, is the congestion-free update. Essentially, what happened here is y-axis shows the complementary CDF graph across all the update cases across different links, and the x-axis here we shows how much traffic overloaded in a bottleneck link. Those are the additional traffic you cannot carry because you get overloaded at the interface. So if you look at one-shot update. What we call one-shot update is you don't care about the update order, is you just issue all the updates at the same time and then see what happens in the network. So what's more interesting is the non-background class here, where even user-facing traffic can easily get uploaded at some cases. Then we can see up to 15 megabytes. Those are the additional buffers you need to have, and that's something we don't have in the commodity switches, so in practice, in the test evaluation, we also see a huge drop in throughput for those interactive traffic because this transient condition could happen. And for SDT, we see a much better performance, where non-background traffic will get totally congestion free, so there is no line here. And for background traffic, it was still very much better than the one-shot update. All right, any questions before I move on to the next step? >>: So the time you presented this at SIGCOMM, it was a presentation for Google, as well, right? >> Chi-Yao Hong: Yes. >>: Could you contrast this work with that? >> Chi-Yao Hong: B4, right? Essentially, we are having a very different target as B4. We share some high-level common architecture with B4. We try to -- the common part is we tried to build out the software-defined architecture and tried to make the network run more efficient to satisfy the service requirement. The different thing is, essentially, Google runs two WANs, and one WAN is the B4 operating, where they have less user-facing, user-interactive traffic. So one key major challenge we solved here is how to ensure that the active traffic will always get through. Those are less like a concern in their design, and they don't address that that much. So, for example, the congestion-free update I just mentioned and also the how to protect the interactive traffic in the network while at the same time you want to drive the utilization to 100%. That's the hard part we are solving, and they didn't solve. >>: [Indiscernible]. >> Chi-Yao Hong: Yes, essentially, user-facing traffic goes to another DC, and arguably that one does not give you 100% utilization. And the smart idea here is you can actually make them together and by filling and shaping only the background traffic, you get much better. At the same time, you get high utilization, at the same time you can still protect the service requirement for interactive traffic. Yes. >>: Another question, I know you don't have an answer here, but how does this [indiscernible]? >> Chi-Yao Hong: Oh, you want to ask the differences? >>: [Indiscernible] update? >> Chi-Yao Hong: So the focus is a little bit different, and we are able to do rate control here, and the context is this is a wide-area network and we are able to do things for background traffic. >>: I have a question about the trend. So if you project five years from now, how is this problem going to change? Are we going to have more datacenters? Are we going to have more traffic running over the datacenters? Are we going to be able to scale better with just having the centralized solution? Like how things will change, projecting into the future? >> Chi-Yao Hong: Sure. Of course, we will have more datacenters. People are actually building new datacenters, and you expect to have more traffic running and higher capacity in the network, as well. But again, the fundamental challenge is still there, if you look at those. Those network goes high capacity, across continents, those are still very expensive, and you can't afford to do heavy over-provision for them, and that's why we still need the architecture. It's just the workload can change a little bit and the network scale will increase. Those are the things we are trying to take into account in this design, as well. >>: Do you think as we have more and more datacenters, does it make any sense to actually stack and partition these datacenters and say some of them are going to handle user-facing traffic and some won't? >> Chi-Yao Hong: That could be one solution, but the thing will be, that will be less efficient in the sense that if you do, say, hard isolation and then efficiency comes from that. Yes. All right, how much time I've got? 11:27. So when should we make -- how much time should I make stop, 10 minutes from now? >>: Yes, [indiscernible], 11:45? >> Chi-Yao Hong: Twenty minutes, 10 minutes from now? >>: [Indiscernible]. >> Chi-Yao Hong: All right, all right. So I still want to spend some time to talk about my future plan, so I will try to be brief to talk about this part because of time limitation. So this is ongoing work. I'm studying how to push the scalability limit further, to make the central transport rate control to be more scalable and more real time. So essentially, what we see here is a tradeoff between the scalability and flexibility. So if you look at today's network transport protocol, they are mostly distributed, and they are very scalable in the sense you can do fine-grained for TCP connection level scheduling, things like that, but they are not very flexible, right? And SDT is quite flexible, but it's not that scalable. You cannot do fine-grained control. So the interesting thing is, we try to push them both and see how far we can push towards more fine-grained and large-scale control in real time. So just a couple ideas I want to talk about, and one is to be able to scale up to the large datacenter network size, what we do is to do flow differentiation. We handle long flow and short flow in very different ways. So long flows we handle centrally while we let loose of the control of most optional flows in the network, and the intuition here is that because of the datacenter traffic distribution, very heavy tailed, where most of the bytes are generated by the long flow, while most of the flows in the network are short. So by doing so, we're able to improve the scalability by an order of magnitude and still controlling most of the bytes in the network. That's the rationale here. >>: [Indiscernible] more data that's transferred? >> Chi-Yao Hong: Yes, that's the amount of data you want to send. >>: You always know that? >> Chi-Yao Hong: You don't know that before. So one easy prediction I will talk about just now. Okay. Yes, so that's the central architecture. We have a logically centralized controller, controls the end host transport sending rate, and when the flow started, we assume it's a short flow. So it initiated with any transport protocol it allows you to use. You don't have to talk to the transport controller. You just say, if it's your flow, you'll finish in time. And in the network, we provisioned it to use high priority, so you don't have to compete with the other long flows. And only when the flows last in the network, when it sends more than a certain number of bytes, then at that time point, we classify them into long flows. And they will send the flow demand, hey, this is how fast I want to send in the network to the SDT controller. And the SDT controller computes the resource allocation based on the transport policies like fairness, priorities, those transport policies. They want to improve. And then, once they are computed, they allocated the rate back to the end host, and the end host will do rate limiting to enforce this allocation. And then fall back to send with the rate given by the central controller, and this could happen multiple times across time. If there's other flows coming in, then the flow rate can be updated. And then it will also fall back to low priority in the network, so now it won't compete with the high priority. Yes. >>: What's your user model? >> Chi-Yao Hong: User model? >>: As in are the flows minutiae or [indiscernible]? >> Chi-Yao Hong: So for now, we assume -- we don't look at cases where the users can game the system, and that's something very interesting I'll talk about in the future plans, as well, where users usually have intentions to game the system by either using multiple TCP connections or claiming the actual flow here. By splitting the traffic into multiple connections, that's something we want to deal with later. >>: So another interesting idea to scale up this is that if you want to use a centralized controller that has limited computational power, how fast we can recomputed this resource allocation is we'd run parallel flow rate computation algorithms. You use multi-threads to leverage today's CPU architecture and multi-threads to compute. So yes. >>: So can you think of the [indiscernible] as essentially as [indiscernible], or is it a different TE with a different code? >> Chi-Yao Hong: It's more like a transport rate control target here we're looking at. >>: [Indiscernible] variables and what's the goal? >> Chi-Yao Hong: The goal can be specified by the operator, so a couple of things in our current implementation, including prioritization across in the flow level, can prioritize. Assume you are able to emulate an infinite number of priority queues and also the weighted max-min fairness by implementing fair queues. Yes. All right. So the idea here is -- I'm running out of time. I'll try to stay a little bit. So the idea is to run the flow-level simulation at the controller to compute how fast you can allocate to each flow rate. So let me try to give you just this one simple example. This happened inside a controller, where you have the view of the whole network, and you have input flow demand where it flows through. You know the path it takes, and then you want to decide how much rate you are going to allocate this flow, so you have input flow demand, and each link is handled by just one thread, and they do the resource computation, like for example, fairness or prioritization, based on the input flow rate, how many flows coming into this link and also the configuration by the transport operator and say what's the policy I want to enforce and decide how fast each flow can send off. Suppose there are some fairness concerns here -- hey, [rev] flow should get higher throughput and it allocates 0.7 and 0.3 here. And handle this, offload this, give this information as the output flow rate, as the input flow rate to the next set of links. And then, eventually, difference rates will compute until the results get to the destination, and that's the flow rate you've got to allocate, tell the actual sources to be influenced. >>: Are you going to synchronize across the biggest rates along the flow? >> Chi-Yao Hong: Yes, interesting. Interesting question. One obvious thing is, for each link, right, it's essentially multiple writers and one reader, so to be safe, this is a common case where people put mutex to protect the concurrency issues. But we found what we have is a dirty bit here to protect. The bit means that if it gets updated, you will update it, so you have to recomputed the allocations. Otherwise, if the dirty bit is zero, you would have to recomputed it, because it's clean, so avoid unnecessary computation. And because it's just one bit, you can easily flip it in atomic fashion. And we found in this case, where if you do first clean it and then send a clean before update and also mark it after the computation, in that case, we found we got lucky, so we don't have to put mutex, and we showed that -- theoretically, we showed that you won't get into the bad stage. You won't have the case where the link is actually dirty but you mark it as clean, so you never get recomputed. Okay. So this is the prototype we built for evaluation at UIUC, and we used 13 OpenFlow Pronto switches and servers that can drive up to 112 gigabits per second and then we used Floodlight OpenFlow Controller, and servers, we ran on 112 Linux machines and used tc for rate limitation and also iptables for packet marking. And due to the time, I will show you just one key result we found, out the controller scalability. So here the access shows the network's size, and here we use the traffic distribution measured in the datacenter workload, where we classify flows based on flow sizes, and then we try to see, given network scale, what's the control interval, minimum control interval we can support using a single desktop here. And the y-axis here shows the log scale control interval, where given assumption that we can -- we handle the flows, mostly long flows, and give the threshold there. Yes? >>: This [indiscernible], so I'm a little bit surprised that you stop at four threads. >> Chi-Yao Hong: Oh, that's because our current desktop only had four cores, and that's why we don't see a huge improvement after four threads, but we are keen to test it over the other computers, as well. But for now, we don't really what's the limitation here. It just looks like we've got linear scale up for the first four threads. That looks a little bit promising. Yes. So that's the case I want to highlight here. And another observation I want to show is, with only a single desktop, we already scaled up to several thousand of servers with sub-second and controlled intervals and, at this range, we're still able to handle more than 96% of the total network bytes by letting loose most of the short flows in the network. >>: [Indiscernible] training task for the two [indiscernible] updates for flows? Like, if you have a graph, rather than run a [indiscernible], you could go fall back on combinatorial algorithms. Have you tried any of those, or you compared with them? >> Chi-Yao Hong: What's the algorithm you have in mind? >>: There are a lot -- for example, the people who did [vague] configuration. I don't have a specific algorithm in mind, but it feels like there's a space of those ->> Chi-Yao Hong: For now, we haven't tested, compared with the other algorithm. That's something that it could be interesting to compare with. Yes. All right. Sure. >>: I have a question over here. So if you look at the y-axis equals one second, the red threads that can do 2,000 servers. Is my reading correct? >> Chi-Yao Hong: Yes. >>: And the blue line can do how many servers? >> Chi-Yao Hong: Roughly 8,000. >>: So it's linear. You basically -- yes. >> Chi-Yao Hong: Yes. Is that the question? >>: Yes. It just feels to me like it's perfect, right? It's like four threads do four times the -- so I don't get the ->> Chi-Yao Hong: You may get saturated later if you had more threads. Essentially, what the current observation is, we have enough L2 caches, so most of the stages can be put in caches, and that's why multi-threads can run things faster. >>: But it seems like there's no overhead to the synchronization of the threads anymore, because with four threads, you do four times the work of one thread, so it's ->> Chi-Yao Hong: You may get saturated later, when the total state is larger. For now, and the operation key intuition we got, it's because we can put all the things in caches, and also, we don't have to put mutex, so there's no blocking across threads. Yes. >>: It also depends on the pattern of the flows. Flows are fairly non-overlapping, and you don't have that much mutexes. >> Chi-Yao Hong: Sure. Good point. I'm going to skip this demo. This is just one demo, I want to show how this works, and the prototype evaluation, due to time limitation. So I want to briefly talk about my future research plan. My future research plan is tot year to extend the current design and try to build the cloud network operating system. So what this is, is essentially a whole collection of software to help you to manage the network resources in the cloud networking and also provide the right interfaces to the network operators to help them to more easily manage their network and to run network more cost efficient and provide high performance. Okay. So what this is telling you is, essentially, we're replacing this part to enrich this part, the purple part, to build a network operating system that allows you to do network resource management on this layer and then expose API to the network applications, users, operators, so they don't have to worry about what's the underlying distributed system inside the cloud. They will express the requirement they need -- for example, high-level policy they want to reinforce in the network, without worrying about the detail about, like, say, how to manage the memory, switch memory, how to set up the forwarding paths. They don't have to worry about it. Yes. So this is a big direction and big vision here, and I want to specify a few more specific research directions I want to look into, so one of them is to make the resource management more efficient. So, for example, where you place the virtual machine today is, today, what people do is they want agility in the network, so they don't care about -- they don't have to worry about where they place the VM. But with SSDT architecture, you have a great knowledge about the network for the planned and the current traffic and also the current other sources. You are able to do a much better job by placing the VM without competing with the other VM's traffic in the network to make the whole allocation more efficient. That's something I am very interested to look at. Another thing is middle boxes. I think people also pointed out that there is a firewall event or even WAN optimizer you want to place in the network. So at one extreme, you can place those middle boxes everywhere and you pay high cost. At another extreme, you place them at just one central firewall, but then you have to redirect all the traffic to go through that firewall, to make sure that not imprudent traffic that does satisfy the policy will be blocked. So interesting idea is about how to integrate this design with our SDT architecture to make the whole middle box either -- both the placement issue and what's the loading configuration there to have a joint design with our current architecture to make the middle box placement question more efficient. Also, the network extension, for it to add new capacity. >>: Isn't there a dynamic placement? >> Chi-Yao Hong: It could be dynamic, yes. Yes. It depends on the current workload, for example. It depends on how you implement it. If you're implementing the software, it could be very flexible. >>: So it's similar to VM placement. >> Chi-Yao Hong: Similarly, similarly, yes. >>: Similar to VM placement, you [indiscernible] inside the VM. >> Chi-Yao Hong: So this is more like you can change your traffic mattress, as an input to the network resource optimization, and this is more like you're adding a new constraint to the network when you do network optimization. So kind of all related, and people are doing this isolated, so what I proposed is to try to look into this and build a giant operating system that solves all these questions, provides the right interface to the users. And this also, management switches resources. One interesting idea is to compress forwarding rules because we have very limited memory, especially with [indiscernible]. But that also may complicate the virtual updates if you compress the rules. That measures on the similar IP blocks. Let me come back first to updates, so how to make things run very efficiently while still transparent to the network operators who want to add network transport policies. That's the key things I want to look at. All right, last two minutes, I want to talk about another direction I want to look into, is about network performance versus application performance. So what this is saying is if you look at the base networking, many people try to optimize the network-level performance metrics, such as throughput, packet delay fairness or network utilization. Those are useful for network operators, but if you think about applications, they care about something different. They don't care how fair it is. They don't care what's the utilization. What they care more is for certain types of jobs, they care when the job got finished, got complete, so they can send back the result to the users. So I want to look into this like how does current protocol network-level improvement can be translated into the application level? Do they enjoy having this nice network, high-performance in the network matrices? And if we have a better understanding of this, it would be great, because it will help us to decide the right abstraction we want to expose to the application. For example, what we want from the application, what they really need there. All right, and the last thing I want to briefly mention is to design an incentive compatible resource scheduling to make the whole network resource allocation more efficient. Essentially, use of incentives to game the network and they can claim to have infinite demand. They can claim, I've got the highest priority, this is very time critical, or I can also create multiple connections and say each one is a shortfall, for example. Try to game the system to get more resources, and that will make the whole resource -- because the input is heavily biased, and no matter how good resource allocation you have, if the input is heavily biased, then you cannot make very efficient decisions. So a couple challenges here I wanted to look into, including how to do maintenance and design to provide right incentives to the users, and also system design challenges like how to do realtime usage monitoring in a lightweight fashion, passive way we mentioned, and also for those applications who try to game the system, how can we penalize them? How can you, say, discard their traffic or put them in the low-priority queues? How to do enforcement there? And also, we want to ensure there is a worst-case performance guarantee for the cooperative services. And if this can be done, this will be great, because essentially what this is doing is encouraging people to have early traffic demand declaration, and that's also good for our network operating system, because we can make a much better decision based on this. Yes. All right, so if you don't mind, I can take another just one minute to talk about the other related work I have done so far. I know I have got to meet many of you here, and if you're interested in any of these papers, we should talk in person. So those are the two papers I mentioned in most of this talk, and there's another paper, we show how to do preemptive scheduling, more fine-grain resource scheduling inside the datacenter network to complete flow faster and meet flow deadlines. And there's some early work I have done in wireless domains, also the resource scheduling problem, where we look at how to do link scheduling with the constraint of interference and to especially use the improved spectrum efficiency. And there is also network defensive systems, where we look at a coupled network security projects to build defensive systems to improve the network security, such as the first one is called [Bitminer], where we look at the IP addresses of interesting IP addresses, where there are lots of users behind those IP addresses. Some of them are legitimate users created by a proxy or gateway behind middle boxes, but some of them are created by attackers and try to abuse the system. So here, we try to do classification and make a case where, if you look at their network level activities, we can actually be able to catch lots of bad users who create proxy-like behavior there. And, also, there's an anomaly detection project I have done to look at the trouble tickets and user care tickets and try to do anomaly detection for those traffic, using both hierarchical heavy-hitter techniques and also the time series analysis. And there's another early project, BotGrep, where we look at the network complication traces, try to identify if there is peer-to-peer bots in the network by using the random work technique and also we do clustering algorithms there. There is a network management project I have done about understanding the root causes of the Internet clock synchronization inaccuracy, where we identify the most inaccuracy component comes from the asymmetric of the path it takes in the Internet, and we propose several solutions to compensate from that, to improve the inaccuracy there. And also there's a datacenter topology design project, where we proposed you just connect your datacenter randomly and you get much higher total throughput than today's structured topology like [indiscernible]. So I'm happy to take any questions, and this concludes my talk here. Sorry, I ran a little bit late for two or three minutes. If there is any question, I'm happy to take. Thank you. >>: I have one question. >> Chi-Yao Hong: Yes. >>: Yes, I'm just kind of trying to understand your kind of perspective on this. Everybody these days talks about building network operating systems. >> Chi-Yao Hong: Sure. >>: So as you were describing that, I was trying to parse apart what do you think is different in your version of network operating system versus as it's being talked about tens of other people. >> Chi-Yao Hong: Yes, yes, so many people talk about, for example, building virtual topology, building the network isolation, performance guarantee, building the nice abstraction, exposed mostly to the cloud computing side where you have customers you want to use. For example, in the Azure -- they have very different forecasts, right? They look at how to make the network programming more easily if they want to set up a web server in parallel fashion, how to integrate it with .NET, for example. Those other things are interesting components, but not my main focus is a focus on network resource scheduling. Essentially, what I want to look into is kind of a little bit lower layer, where we try to make the whole network more efficient running by having a nice resource allocation, including how to set up middle boxes, how to set up the gateway, how to set up the right network forwarding path and how to set up the end host rate control. So those are resource-related projects I have found it especially more interesting to look into. Does that answer the question? So if you want to ask for a more specific question, more specific project I want to look into, then those are the good cases, like where to place the middle boxes, where to add the new capacity in the networks. >>: Such as the [indiscernible] placement project, there's work coming out of Cambridge that also talks about embedding VMs such that their requirements, network requirements, are met by an underlying topology. How would what you're suggesting be different? >> Chi-Yao Hong: The suggestion is you should integrate that with the other interesting, important factors about resource allocation in the network. So, for example, you are able to change the network forwarding path load balancing across. That changes a lot to the solution quality. If you look at these, just where to place your VM to better match your current topology, you can do very limited things here, right? So the proposal here is to try to solve the problem jointly and try to make the wise decisions based on multiple resources in the network services and try to provide a more globally efficient network. >>: On this issue of joint optimization, it sounds always repeating, but if you just look at operating systems, would you say operating system schedule is a layer of application performance? Or is it an independent load on a substrate? >> Chi-Yao Hong: So there are things you can decouple, right? And there are things you cannot decouple for the performance efficiency reasons. As if you look at operating systems, they do, say, job-level scheduling. Of course, they also do other things like, for example, task resource scheduling, memory scheduling and CPU scheduling, kind of isolated. And those are the things you can decouple a little bit without losing too much efficiency, but what things we are looking into here is those are closely related and closely coupled. If you don't do very careful, say, VM placement, then essentially what you get is an input as the traffic demand matches to the network. >>: But given the operating system [indiscernible], I could come up with scenarios where if I don't do joint memory and disk allocation, bad things will happen to application performance. >> Chi-Yao Hong: So people just do that. People just do that. If you look at the real-time operating systems, they care more about each job's service requirement, like the deadline things. They do more careful scheduling there, but not in the commodity, normal PC, where jobs does not have certain -- that critical deadline to meet. So if you can put additional effort there, of course with a cost, you have to get a knowledge about different components -- you potentially can do a much better job there. Yes. >>: Let's thank the speaker again. >> Chi-Yao Hong: Thank you.