>> Jitu Padhye: It's my great pleasure to welcome... him already. He was the lead intern on the...

advertisement
>> Jitu Padhye: It's my great pleasure to welcome Chi-Yao to give a talk. Many of you know
him already. He was the lead intern on the SWAN project last summer, and he has continued to
work in the area of software-defined networks.
>> Chi-Yao Hong: Thank you, Jitu. So let's get started. I'm Chi-Yao from currently at UIUC,
PhD student there, and today, I'll talk about my systems for a software-defined transport and also
this is tied with SWAN project and some extensions. So let's begin with a quote for fun. This is
anticipated that the populous part of the United States will within two or three years be covered
with network like a spider's web. Anyone want to make a guess when this is from?
>> 1845?
>> Chi-Yao Hong: Who said that? Who said that? Okay, any guess? Any guess. Sixties,
1960s? Okay. Okay. That's pretty much close. That's a very good answer, and this is for
electric telegraph, so this is just for fun. And the code here essentially illustrates that people
have designed network systems for hundreds of years already, and it's just in the past they are
constrained I a very different way, right? For example, people designed wide-area networks,
they have to be constrained by where the population is and where, for example, even the railroad
is, to deploy the network. And today, in this talk, I will talk about cloud networking. So what's
so special about cloud? Why are we redesigning the whole network for cloud? Especially, the
first time you take a step back and look at today's cloud infrastructure, people have designed
datacenter networks to interconnect hundreds of thousands of servers inside DC, and they
deployed multiple datacenter networks across the continents, and also they built their own
backbone, like today Google or Microsoft of Facebook. They have their own Internet, interdatacenter, wide-area network backbone to deliver useful traffic from one site to another site.
This picture shows Google's early deployment of their inter-DC WAN. So what's so special
about this? Why do we need to have a different network for this? Why can't we just use today's
network protocol end and run? So I see, there's challenges and exciting products I want to
address. So first is it has been a very challenging task, because it's a critical infrastructure.
Essentially, this is resources that everyone is depending on. You just can't fail. You must
provide high performance, especially you want to meet the service requirement, you want to
make your customers happy. This is something that directly reflects your business requirement,
and you want to make your customer happy, reflect your revenue automatically. So by doing a
good job of optimizing the network, you are potentially keeping your customer happy. And the
second challenge we want to face is also at the same time it's extremely expensive resources
here. So, for example, Google spent more than $7 billion last year on their datacenter
infrastructure and more than 15% to 20% of the DAC cost goes to the network equipment. So by
more efficiently running your network, optimize it to have a better network resource usage
efficiently, then you are potentially saving lots of money there. You are making a huge impact
there. And at the same time, it's also very exciting places to do research, so we have the central,
unified control of the whole network, which we don't get in the traditional Internet network. And
that's great, because you can have the new network architecture to be physically deployed in
these new places, and you can get more impact and huge impact by deploying the new stuff.
And also, this is a recent concept. I don't need to describe SDN here at Microsoft, but the SDN
idea allows you to flexibly assess the switches forwarding plan to be more easy to change the
forward plan, and that's something that can be tied into this new architecture to be deployed. So
let me try to step back and look at today's cloud networking and how they run the network and
why it does not satisfy our goal with two key motivations is how to achieve network with very
high efficiency, because it's so expensive -- it's extremely expensive. You want to save money
there, to run the network very efficient. At the same time, it's also highly critical infrastructure,
so you want to satisfy those user requirements. So why don't we just run today? What we have
today is a soup of protocols with knobs and dials you can configure over, like TCP, BGP and so
on and so forth. So, for example, a routing protocol runs across different switches and also the
in-house-based protocol runs across different net hosts. So the key issue we think is there's no
clear, programmable API for those protocols. Essentially, if you want to, for example, optimize
your network resources, if you want to have some fitness constraints across services to maintain,
for example, performance isolation across different tenants. There's no clear way for you to do
that with today's protocol, and it's essentially hard to optimize your network with today's
architecture. I think there's three key problems in this domain. So, one, they are not very
flexible. They are mostly monolithic, and they run their predefined algorithms. You have little
knobs you can tune across different things. If the algorithm they run is not what you want, you
won't get the property you want. And, second, it's also very hard to reason about what's the
performance they will provide you with today's network protocols. Essentially, there is a
mismatch, a big gap between what today's protocols can provide and what high-level transport
policies that network operators usually want to enforce. So, for example, as a multitenant
datacenter network operator, this is probably something you would like to have. For example, I
want tenant A to have some latency guarantee for all the services they are running across their
VMs, and I want -- for tenant B, I want some bandwidth guarantee, if they are running big data,
moving from one site to another site. And for the rest of tenants, I probably want something like
a mix of prioritization or fairness based on their payments, things like that. Those are the highlevel policies you want to enforce, and you cannot tell today's protocol, hey, this is what highlevel policy I want and just go and optimize the network for it. You simply cannot do that.
There is a big gap between what those protocols provide and what high-level policy they want to
enforce. So some people are smart. They can come up with their own algorithms and say, I
know this is what I want and why I want to implement the with these functions to support what I
want, but the issue with that is it's still very hard to be deployed. Implementing new protocols
usually requires custom-made changes at even in house or network switches or even both and
make the time to market and run time very long, so it's not that practical to deploy. So to fix
those issues, my vision is to make today's transport architecture more programmable. And we
think this can serve as a killer application for optimizing network performance. So this is the
architecture we proposed, and then we go SDT, Software Defined Transport, and let me try to
give you a quick view of the system flow and how it works. Essentially, you get a network
where you have black dots here are switches and blue dots are servers and machines. All right,
then this architecture is tied with SDN, and what SDN gives you is simply a thin interface to the
switches' data plan and allows you to change the forwarding plan, and also a logically centralized
controller that talks to those devices to allow you to program the network forwarding plan. So
what this gives you is essentially just low-level access to the network switches. We can change
the routing table, we can change the forwarding plan. But what's the northbound API we need?
It's not the northbound API. It's not network optimization functions, network optimization
interface. It's not the right API we want to expose to the network operator to use. Essentially,
there's some rough consensus on which protocol we want to use between the SDN controller and
network devices, but there's very little consensus on what's the right whole framework.
Essentially, we need is an equal system that runs across -- that tries to optimize the whole
network with the goal of maximizing the network performance and make the customer more
happy. So to solve this, there's another interesting, important block component we leverage in
this architecture. It's also the same interface you attach to the in house to control the server
sending behavior. So also we run a parallel component here we call host controller that controls
the network sending behavior by either -- both allocate the rate to the end host and also collect
the float demand from end host to know what's their requirement on network usage. And on top
of this, we have another layer we call resource optimizer, where it's the actual places we run
those interesting algorithms, resource allocation algorithms, such as how much rate I should -each service can send and which path they should choose, which path they should load balance
across based on the information given by the host controller about flow demand and also the
topology traffic information, network level information from SDN controller. And also, you also
get the high-level utilization -- high-level utility function and transport policy from the network
operators. All right. Any questions so far? So let me quickly give you some brief summary of
the key results when we apply this network -- when we apply this architecture to different types
of networks. Yes.
>>: Are you only looking at resource management policies, or are you looking at a richer set of
policies? I don't know, like traffic should flow through two [meter] boxes, or traffic should
never flow through China?
>> Chi-Yao Hong: Sure, sure. So for now, the whole architecture you can think of as we
proposed here, there, there's an opportunity here to be implemented to support other things, and
in the end of this talk, I will also talk about my future research plans, and of course, [meter] box
placement is something very interesting and also highly related to the resource allocation, where
it can be tied and integrated with this architecture. So there's other things that can be discussed.
Like, firewall could be one thing that's very interesting.
>>: But for this talk, you're going to focus on resources.
>> Chi-Yao Hong: Exactly. All right. So let me try to summarize the results. So when we
applied to the inter-datacenter WAN, we showed that this architecture could help us to carry 60%
more traffic than today's practice, and also we have congestion-free update to ensure there's no
condition that would happen during the network transition, when you do the global changes of
network forwarding plans. And also we showed that this can work with limited switch members.
And another scenario is we applied this architecture to the datacenter -- inter-datacenter networks
where we can see a huge improvement by doing more fragmented flow scheduling based on the
software control framework. We can save 30% mean flow completion time by doing a good job
by optimizing the flow scheduling. And also we showed that this supports three times more
deadline flows than today's practice. And we also can show the above-scalability results, where
we can simply use just one single controller using the desktop, where we already will be able to
scale up to 10,000 of servers with subsecond response time. Yes, Alec.
>>: What's a deadline flow?
>> Chi-Yao Hong: The deadline is essentially in especially for many online services, they have
this multi-stages behavior where requests come in from the user. They want to respond in
certain deadlines, and that makes into this operation into multiple stages, so they have certain
deadlines to flows, so flows have to complete in 30 milliseconds, otherwise, the results won't be
integrated into the final result getting back to the user. And that's why the context here ->>: In a way, they're based classifiers?
>> Chi-Yao Hong: Yes, those usually, deadline can be specified by the service providers. Like,
for example, if you're running a web search, you don't want your results getting back to the user
too late, and there is a certain study that says if you are delayed for 100 milliseconds, you can
drop your total revenue by 3%, 5%. Those are the Google, Amazon studies. That's why in the
context of inside-datacenter resource scheduling, that's quite important to look at network-level
flow scheduling deadlines, and those are people they tried to satisfy.
>>: The way that this works is the network administrator will configure the deadline to the
controller?
>> Chi-Yao Hong: Yes, so flows are attached with deadlines, and you either meet the deadline,
or it's not going to be very useful. All right. So in this -- so this is where those papers got
published and the people I collaborated with, many people here, I know, and also the second part
is published at SIGCOMM 2012 about how to do resource scheduling efficiently. And the last
part is on my ongoing work on resource scheduling inside inter-datacenter networks, and we
have some early version published in Open Networking Summit just a couple of weeks ago. All
right, so this is actually too much things I can talk about due to limited time, so I want to put my
focus on the following three things. One is when you apply this architecture to WAN, how to
optimize the WAN, how does this architecture help us to improve the WAN efficiency. That's
the big application I want to talk about. And the second thing is when we apply to the interdatacenter network, how can we first approach the scalability to more fine-grained, real-time
resource scheduling based on this software resource controller to do transport rate control? And
then finally, if we have time, I'll talk about my future work and the other work related. Okay, so
let me try to give you two motivating examples about WAN, why are we looking at the WAN?
So this is essentially the production inter-DC traffic, major from production WAN at Microsoft,
and here we show the time series of the normalized traffic rate across roughly a day. Essentially,
as you can see, traffic goes through peaks and valleys across time, and what we see here is
today's practice, they are trying to protect the important traffic, so the important traffic will
always get through, and to do so, they're protecting the peak. That's where we normalize the
traffic rate here, to make sure important traffic will always get through and you won't get
dropped or delayed in time. By doing so, the issue is the mean -- yes?
>>: Just in comparison, so the traffic can go to zero between two datacenters?
>> Chi-Yao Hong: It could, so for example, inside datacenters, some places are just small
clusters or it could be a trend set point. They could have little demand to the other places. That
could be the case in some kind of locations. Yes.
>>: This is normalized to the peak traffic.
>> Chi-Yao Hong: Yes.
>>: So what is the actual utilization scheme?
>> Chi-Yao Hong: Actual utilization? So actual utilization is even higher than this. I'm saying,
assume the ideal bandwidth provision for this, you already see the mean utilization for this link is
below 50%.
>>: So what I'm saying, at your highest traffic trait, what is the actual utilization like?
>> Chi-Yao Hong: Highest utilization? So in average, it's like 30% to 40% network wide,
across time, of course. So one thing to fix this is to look at traffic heuristics inside the intraday
essential network, and we have different classes of traffic going on here. We have background
traffic. Those are replication traffic, and you know replication, moving the big chunk of indexes
from one side to another side. Those sort of things can be delayed for hours without
compromising their service requirement. They are moving big data there. And another type of
traffic we call non-background user-facing, time-sensitive traffic. Those are the traffic related to
the user. Those are user triggered or user-related traffic. You don't to delay them. You want to
get -- yes, questions.
>>: How do you know that you can delay the background traffic? You claim that you can
definitely delay the background traffic. How do you know that's true?
>> Chi-Yao Hong: So this is the enterprise we run, for example, at Microsoft. We talked to the
app developer, we talked to them to see, what's their real requirement there, and we tried to do
traffic classification and tried to delay, without compromising their requirement for hours. So it
depends on which type of service we are talking about. Some of them, if you deliver in a day,
that would be okay for them. Yes. So different types of traffic, essentially, we talked to the app,
and we have the control of the whole network. We have the access to the network application
developers, and that's how we get this information back. Yes.
>>: Was there a more concrete definition for what was background and what was labeled nonbackground?
>> Chi-Yao Hong: So one definition I can give is non-background is user facing. That you don't
want to delay. You don't want to rate control. You don't want to delay them. Those are the
cases you have to adapt to that and you do passive measurement to look at how much demand
they have sent so far and do predictions about what they have in the future. They don't
participate in this architecture. You can think of it look like that.
>>: So the inter-datacenter traffic for user-facing ->> Chi-Yao Hong: Yes, yes. So, for example, like you're moving some data, like doing a copy
from one site to another site for another purpose. That's something more like more time sensitive
than the background traffic, where you don't care about if it happens in two hours or three hours.
>>: So this is from real measurements, correct?
>> Chi-Yao Hong: Yes.
>>: So I'm curious why the non-background and background traffic look somewhat correlated.
In fact, some of the peaks seem to be very ->> Chi-Yao Hong: It's a stack graph, so it looks correlated. If it's not, then it's less likely that it's
correlated. Because this, you see when there's a spike here, there's also a spike because it moves
up. So if you just look at the blue part, it's actually flat. Yes. It's a little bit bad representation
we have to think up why does -- all right. Sorry. So in SDT architecture, if you are able to delay
just the background traffic classes, how you guys, actually, you are able to cover, deliver the
same amount of traffic, just with half the capacity, without even delaying any non-background
classes traffic. So this is great, because we get huge peak reduction, and that's where you can
delay your further deployment or you can potentially accommodate more traffic in the network
with the current deployment. All right. So another example, motivating example I want to show
you, why we need SDT architecture inside the inter-datacenter WAN is because of the efficient
forwarding today. So today, the protocol that you use to run inter-datacenter WAN is called
MPLS-TE, multi-protocol label switching traffic engine protocol. It's essentially a local greedy
algorithm that every source router tries to find the right tunnel to satisfy the bandwidth constraint
with the shortest path to the destination. And that one gives you the best solution. That one
allows you to optimize the network with high efficiency. So let me take this example to show
you why. Suppose we have three flows, A, B, C, here, and assume each link can carry at most
one flow. And then, when flow A arrives from one to six, MPLS-TE will choose one shortest
path that satisfies the bandwidth requirement, so you have two available paths here, and then you
would just randomly pick one. And then flow B arrives, you would take the second-shortest path
here, because the shortest one will interfere with flow A, so the bandwidth requirement cannot be
met. And so does flow C. It ends taking a very long path. While in the right-hand side what we
show is that if you have the ability to globally coordinate your forwarding plan, then this is what
you get, where we see a much better allocation can be made, where most flows get fewer
latencies and you see fewer amount of network resources and lots of capacity gets freed up, so
for example, this link gets freed up, so you can potentially accommodate more traffic in the
network and to make your network more efficient.
>>: Is your optimal assuming that you know the future, you are able to know what the other
flows are going to be?
>> Chi-Yao Hong: Yes, so in this context for background traffic, for example, I can even
control how fast they send, so it's just not like I know the demand, but I also can control the
traffic [mattress], and that makes the whole design more powerful, in the sense that we know the
traffic [mattress], and we can even control it, and we know what's the current forwarding plan,
and we can coordinate the forwarding plan to optimize the network.
>>: Wouldn't it be optimality criteria here, because all flows are varying their bandwidth in both
settings, right? So it's a delay which you're optimizing for? What are you optimizing for?
>> Chi-Yao Hong: So in both delay and also amount of resources you use, in here is, as you can
see that this, the number below this is much smaller, so potentially you can --
>>: [Indiscernible].
>> Chi-Yao Hong: The link is carefully utilized.
>>: So you want to minimize the number of links that are utilized? Is that what the idea is?
>> Chi-Yao Hong: So this is just one example, to show you how much efficient allocation can
give. In this particular example, what we're trying to do is minimize the pass length, and that
gives you two things. One is latency. The second thing is fewer pass links you use, fewer
network resources you use.
>>: So latency I understand, and maybe this leads to what [indiscernible]. Why do you want to
use -- you already paid for these network resources, and they're there. So unless you are waiting
for future traffic demand.
>> Chi-Yao Hong: So, yes, in this case, there could be other future demands you could not
accommodate in this case, because the current allocation, given the current allocation, you
cannot satisfy those, so those all get rejected. Those are not admitted. But once you change the
note for the [implant] here, potentially, you can accommodate one new flow, say, coming from
here to here and also coming from here to here. Those are the things you can run more efficient,
in the sense of that total higher support you can satisfy, you can provide in the same network,
without adding new capacity to the network. All right. So next, I'm going to tell you how we
used SDT architecture to apply to the inter-datacenter WAN and what's the key challenges there
we faced. So one particular challenge here is when we apply SDT to the inter-datacenter WAN,
we want to run the network more efficiently in the sense that we want to globally coordinate
network forwarding plan and also the sending rate in a very short time, say, every five minutes.
Every five minutes, I want to update the network. And one key challenge is, hey, interdatacenter WAN has lots of flows going out and lots of traffic going out. Can we scale up to
schedule so many flows in the network, and how do we come up with the scalable system design
to make this architecture feasible? So there's a couple solutions we applied here. One is
essentially to make the network as a hierarchy. So, essentially, this is not some design with one
controller that talks to many servers. That won't scale. That won't be practical. Essentially, we
have multiple layers happening here, for example, between the high-level controller and also the
end host. We have the service agent nested between to get the traffic request, for example, from
end host, and then aggregate the demand's request and then send it to the central controller,
essentially multiple layer and try to make the system more scalable with hierarchy. And second
system, we do aggregation. So we are not solving -- if you look at the controller is solving the
flow-scheduling problem, we are not solving at the physical layer. We are looking at the
abstraction of the whole graph, where each node is a site is a datacenter, which is actually tens of
switches, and each link could be hundreds of cables go around from one DC site to another site.
So we're looking at just the graph level, where the nodes are DCs and links are total capacity to
go between two sites. And also, we divided flow into different groups, and with defined by the
source destination, the priority topples. So if they are sharing the same source and destination
and same priority, same service requirement group, they will be aggregated as the same flow to
be computed from the TE.
>>: And the source and destination there use IP addresses?
>> Chi-Yao Hong: Here, it means datacenter or each node.
>>: Oh, datacenters. I see.
>> Chi-Yao Hong: Yes, source node and destination node. So we are aggregating lots of actual
DCB connections there.
>>: So how many switches do you have?
>> Chi-Yao Hong: So the target scale here is number of nodes is roughly like 40 to 100, so you
can do the math, like couple hundreds.
>>: Even after the aggregation?
>> Chi-Yao Hong: After the aggregation, the number of nodes will be like 40 to 100, yes.
That's the ->>: Without the aggregation, the single switch would correspond to?
>> Chi-Yao Hong: You have one order of magnitude larger in terms of number of nodes.
>>: But if this across datacenters, then wouldn't you need to have one switch per datacenter?
>> Chi-Yao Hong: No, we use multiple switches in parallel.
>>: Okay, that's you grouped those together, basically.
>> Chi-Yao Hong: Yes. If they sit in the same side, we group them together. All right. And
there's another thing about scalability issues, it's about algorithms, how can we quickly compute
the resource allocation? So essentially what we do is, we have multiple priority classes, and we
do things class by class. We first allocate resources to the highest-priority class and services, and
then take the remaining bandwidth to further allocate to the second-highest priority, one and so
on. Well, within each class, there could be multiple flows that are competing with the resources,
and what we do is we want to maintain the weighted max-min fairness for the services within
each class.
>>: So I definitely understand the scalability challenge, but I was wondering, what's the
bottleneck resource on this one controller? Is it the amount of memory it has? Is it the CPU
speed? What is the thing that you actually run off in order to -- so this slide presents a list of
techniques to let this network scale, because without these techniques, then the problem starts.
>> Chi-Yao Hong: Sure.
>>: My question is, what would be the limiting resource on this one controller when it doesn't
scale?
>> Chi-Yao Hong: Sure. That depends on your design, of course. If you do, say, central
controller talks to every other servers, then there is an obvious bottleneck happening at the
controller, where you simply cannot take that many requests to compute in time.
>>: Because I don't have enough RAM, or I don't have enough cycles to handle? What is the ->> Chi-Yao Hong: So the key thing is, if you don't do those aggregations, you don't do those
hierarchies, then what you are solving is essentially the physical layer of the whole network,
where you have tens of thousands of links and potentially it's up to 1,000 switches. Those are
essentially large networks you need to consider, and also the server side will send you -everyone has a flow. Each flow is from one server, one IP address to another IP address,
potentially has tens of millions of flows. You essentially cannot solve this in five minutes.
That's essentially very hard to do in terms of providing fairness. We'll talk more about how to
push this scalability limit toward more fine grain in the later of the talk. But here we do
aggregation just for scalability concerns. All right. So computing max-min fairness is
something we found very hard in practice, and today's solution takes up to minutes at our target
scale, like 50 to 100 nodes network, and that's essentially taking too long for us to be useful, so
what we do is we do approximation. We don't want to find the exact solution. We want to find
an approximated solution. And here what we do is, we divided all the demands into multiple
stages based on the amount of demand they have, and within each stage, we have a certain upper
and lower bound on the rate we can allocate it to the flow. For example, in the first state, we
have the lower bound is zero and the upper bound is alpha here. And then what we do is, we run
this commodity MCFs over multi-commodity flow solver. The stand-out solver will help you to
compute the max flow for the multi-commodity requirement, while we give preference to the
shorter path. So after this has happened, we get an allocation rate to each flow, hey, this is how
fast they can send, given the constraint of this upper and lower bound. If the flow gets saturated,
which means that they cannot get what they want in each stage, then we think those are the
flows. They got freezed. Then we will fix their sending rate at this stage. They will still
participate in the next couple of stages of computation. It's just their flow rate would be fixed.
Yes, Peter.
>>: This you only do for the background flows, right?
>> Chi-Yao Hong: So this depends on which class you get congested. So, for example, you
have multiple classes. Then in the first, highest-priority classes, there is no -- everything can be
satisfied, so you don't have to worry about this. This will happen only when the congestion
happens, you cannot satisfy all the people. That's where fairness becomes an important issue.
Yes.
>>: So the priority one, they get everything. Then you run priority two to get fairness on the
remaining capacity.
>> Chi-Yao Hong: Yes. If they also get everything they want, then we don't have to run this,
until it's not fully satisfied.
>>: So the assumption is that the -- because you have priority in the network, then if there is a
spike in the high-priority guys.
>> Chi-Yao Hong: You use the priority two.
>>: The capacity, but then you recompute, hoping that that doesn't change too much in those
five minutes.
>> Chi-Yao Hong: That's correct.
>>: If your high-priority traffic oscillates, then this might not work so well.
>> Chi-Yao Hong: Yes, yes, so we adapt to that, and what we observe in our evaluation is they
are mostly predictable in the five-minute short timeframe.
>>: So can you explain what this phrasing, saturated flow rate means, because I'm
[indiscernible]?
>> Chi-Yao Hong: Sure. So you do allocation. You compute the rate based on MCF solver,
and then you have two upper bounds here. One is the flow demand, and the other one is this
stage upper bound. And look at the minima. That's the real upper bound you get. If you hit the
upper bound, then you are not saturated. Otherwise, you are saturated. Saturated means there is
a bottleneck in the network, either because of the link that is saturated, or you're competing with
the other flows and you can ramp up to all you want, and those are the rates you eventually
allocate to that flow. Okay? And then we run, of course, across multi-stages, and then
eventually all the flow gets saturated. That's the rate we're going to allocate the flows. And,
theoretically, we show that this is an alpha approximation algorithm, so you can explicitly trade
performance for time by tuning the alpha parameters here. And we show the allocated rate is
deviated by that max-min fairness rate by at most an effect of alpha. And practically, we found
in an average case, with alpha equals two, we have two approximations. In the worst case, you
can deviate by a factor of two. And we found most of the flow was still deviated less than 4%,
and average case, empirically, we found it comes very close to official rate. It takes only subseconds, as compared to previous solutions, where it could take minutes to get exact solutions.
So this is one result I want to show. The y-axis here shows the relative deviation from max-min
fair rate, and where we show two solutions. One is SDT with alpha equals two around the
algorithm I told you, and another one is MPLS-TE fairness notion. And the thing we observe is
SDT comes very close to the max-min fair rate, while MPLS-TE the flow rate can deviate from
the max-min fair ratio rate significantly and unboundedly. Yes?
>>: [Indiscernible] non-background first, do they have any deadlines to meet, specifically?
>> Chi-Yao Hong: Yes. It depends on different types of service. They could have the
[indiscernible]. So, for example, some lines could have three hours, things like that, and for
now, currently we look at short-term resource scheduling and try to ->>: [Indiscernible].
>> Chi-Yao Hong: Not really, not really. Here, what we look at is just next five minutes and
how we can efficiently schedule then, based on some fairness. And the hope here is if you've got
your official rate, more than likely you get to meet your deadline, in the sense that you won't get
starved. If there's no fairness notion, I just try to maximize the network total resources, then it's
very likely you are sending a flow that takes a very long path, which is bad for many people, so
you won't get resources at all. You get very unfair. You will miss your deadline. So implicitly,
we are trying to solve that deadline problem, but for now we just look at the short-term
allocation, not the long-term scheduling. That's something that it could be very interesting to
look at in the future. Very good point. Thanks. All right, let's move on. Another interesting
challenge I want to mention is how do we do these congestion-free updates in the network,
where if you think about what we are doing is -- like in MPLS-TE, we are doing global
forwarding plan changes, global coordination. That sounds a little bit scary, because you're
changing lots of tunnels, shifting traffic a lot from one side to another side, and that's something
we should be more careful in the sense, if we don't do these things carefully, there could be
severe transient congestion could happen during the network transitions. So the question here is
how to update the forwarding plan without causing the transient conditions. So one example I
want to show why this is a tricky problem to solve, so here I take the example where we have
two flows in the network A and B, and each link can carry at most one flow. That's the
assumption here. So we want to move from the initial state in the left-hand side to target state in
the right-hand side? So how would you do is -- the key issue here is the network itself is not just
a single machine, right? You cannot change all the flows in an atomic fashion, so sometimes
you can change flow at one place but not the other places. So if where, the flow A got moved
first, then you have transient congestion happen at this link. On the other case -- in the other
case, you have flow B got moved first. You will still see congestion. So, essentially, there is no
feasible solution in this example I showed you. There is no feasible solution that is not violating
the bandwidth constraint during the network updates. So what do we do? The solution here we
take is to leave a small amount of scratch capacity as a slack on every link. So what this is
telling you is that you have, say, a small amount of scratch capacity S, say, one-third of the link
capacity here. So all the flow can take up to two-thirds of the total capacity here, and then you
leave one-third for updates. So now there's a feasible solution I can easily show. First, we move
half the flow B, the top half, and then you move all the flow A to the bottom half, and then you
move the rest of flow B to the top half, and you're done. So at any stage, there's no condition
happening. So this is great, but this is just one example, and we want to ask, does this slack
always guarantee you there's a congestion-free update exists in the network. So we prove that,
yes, if you leave a slack S in the network, and we show there will be a congestion-free update
sequence within 1/S minus 1 steps, and where each step here could consist of multiple updates,
it's just their order can be arbitrary. So how do we -- empirically, how do we find this? It exists,
but what's the algorithm we use to find it? We'd run a linear programming based solution to find
this, and the key rate variable here we call Bi,j,s, where that's the rate, it should stand on the
tunnel pass [indiscernible] essentially take multiple passes to run across. So the input we got is
the edge at zero is initial state, initial forwarding plan, and also that Bi,j,k is the target state. And
we want to find the intermediate state to run across multiple stages from stage zero to stage K,
and we want to make sure there's no congestion that will happen during any stages. So the key
constraint here, we try to protect the congestion constraint -- congestion-free constraint -- is the
following. Essentially, this protects the worst-case scenario. What's the worst-case scenario, is
if you look at every path tunnels, and there's increase or decrease of flow rate, right? And the
worst case you are going to protect is for all the tunnels they get increased flow rate, they have
already updated, so they would increase their flow rate, while for the rest of them who get
decreased flow rate, they haven't. So that was the worst-case scenario we tried to protect, and
with that, we ensure there's no congestion will happen in the network. And we also showed that
this would give us, at most, as many stages as the upper bound we give. So this is great. Now,
we have no congestion will happen in the network, by leaving a small amount of a scratch
capacity. But even 10% scratch capacity is a weight of resources. It's lots of money. You don't
want to water. You don't want to leave 10% unused in a normal time, because that's not
efficient. So what we do is we classify traffic into different classes, and then we do things
differently, to be able to utilize all the network capacities. So, for example, for non-background
traffic, we want to ensure they will always deliver in time, so we want congestion-free update for
them. So for them, when we do allocation, we will allow them to use only up to 90% of the total
capacity in the network. In other words, we leave 10% slack to them. And while for the
background traffic, which is lower-priority classes in the network, we can allow them to use all
the capacity in the network, so there is no wasted. While we will ensure that the congestion will
only happen with the background cases, because of the priority queue protection inside the
network. So I want to briefly tell you this is the evaluation prototype we built, and we have 16
OpenFlow switches, a BigSwitch OpenFlow controller and servers and routers here, and we do
both prototype evaluations to look at packet-level behavior to see how much congestion
happened during the update. We also do data-driven evaluation from getting the production
inter-datacenter WAN traces to around that scale, to see how does this design scale up. So I
want to give a quick overview of how this system workflow -- so at normal time, we periodically
collect the full demand from the end host. Those are the servers inside the DC, about their full
[demand], send it to another datacenter, and also we collect the topology and traffic information
from the SDN controller. And based on that, periodically, we compute the resource allocation to
see how much resource we want to allocate and what's the forwarding plan we want to change.
And if there is enough gain, we would want to do the actual network update, then this is what we
do. We first compute the congestion-free update plan, what's the update plan we want to move
in multiple stages, more multiple stages, to update the network without causing the congestion.
Okay, so yes?
>>: You said for scalability you do disaggregation.
>> Chi-Yao Hong: Yes.
>>: How does this show up here? Is it that the host controller is distributed? It's collecting
inputs from the host?
>> Chi-Yao Hong: This is kind of abstraction, so hierarchy is not showing here. In fact,
hierarchy is up here.
>>: Is it hierarchy from the host controller or also for the SDN controller?
>> Chi-Yao Hong: For our current implementation, the host controller has two layers and SDN
has just one layer, and that's the scale we think we can scale up to today's size. All right, and
next step, you're going to notify the services who get decreased allocation, hey, they can slow
down now, and that's because we want to start the network updates, and so you first slow down
those who got decreased allocation. And once this is done, you do the actual work to change the
network forwarding plan to change the load balancing across different tunnels fraction, and this
coupled take multiple steps, of course. And once the network configuration is done, you notify
the rest of the services who got increased flow rate, hey, they can start sending faster, and you
are down for this [indiscernible], and then you go back to the first step and to periodically
recomputed the service requirement and see if there is another [gain] to do another update. So
just want to quickly give some evaluation results about this. We looked at the throughput, total
throughput, in a network-wide aggregate across all the flows, and this is relative to optimal
solutions, where we assume that it's our goal to get the flow demand with zero delay and he can
control the network forwarding plan with zero delay, as well, so there is no transient congestion
that would happen. You don't have to worry about the scalability of the network, so those are the
optimal we try to compete with, and we show that the SDT design gets you near-optimal
solutions in this context, 99% of that in a particular setting. But compared with the MPLS-TE
compared to this practice, we can carry 60% more traffic through this practice. And try to
decouple the benefit a little bit and look at if we don't have the ability to do in-house rate control,
and then we still get 20% more traffic as compared to MPLS-TE, and this is interesting in the
case where you don't have the ability to control the age. You still get a reasonable amount of
benefit here, with this architecture.
>>: I think you mentioned this about the methodology. So what topology did you use?
>> Chi-Yao Hong: This is using the real topology given by today's inter-DC WAN. Yes.
Another result, I want to bounce it off you, is the congestion-free update. Essentially, what
happened here is y-axis shows the complementary CDF graph across all the update cases across
different links, and the x-axis here we shows how much traffic overloaded in a bottleneck link.
Those are the additional traffic you cannot carry because you get overloaded at the interface. So
if you look at one-shot update. What we call one-shot update is you don't care about the update
order, is you just issue all the updates at the same time and then see what happens in the network.
So what's more interesting is the non-background class here, where even user-facing traffic can
easily get uploaded at some cases. Then we can see up to 15 megabytes. Those are the
additional buffers you need to have, and that's something we don't have in the commodity
switches, so in practice, in the test evaluation, we also see a huge drop in throughput for those
interactive traffic because this transient condition could happen. And for SDT, we see a much
better performance, where non-background traffic will get totally congestion free, so there is no
line here. And for background traffic, it was still very much better than the one-shot update. All
right, any questions before I move on to the next step?
>>: So the time you presented this at SIGCOMM, it was a presentation for Google, as well,
right?
>> Chi-Yao Hong: Yes.
>>: Could you contrast this work with that?
>> Chi-Yao Hong: B4, right? Essentially, we are having a very different target as B4. We
share some high-level common architecture with B4. We try to -- the common part is we tried to
build out the software-defined architecture and tried to make the network run more efficient to
satisfy the service requirement. The different thing is, essentially, Google runs two WANs, and
one WAN is the B4 operating, where they have less user-facing, user-interactive traffic. So one
key major challenge we solved here is how to ensure that the active traffic will always get
through. Those are less like a concern in their design, and they don't address that that much. So,
for example, the congestion-free update I just mentioned and also the how to protect the
interactive traffic in the network while at the same time you want to drive the utilization to
100%. That's the hard part we are solving, and they didn't solve.
>>: [Indiscernible].
>> Chi-Yao Hong: Yes, essentially, user-facing traffic goes to another DC, and arguably that
one does not give you 100% utilization. And the smart idea here is you can actually make them
together and by filling and shaping only the background traffic, you get much better. At the
same time, you get high utilization, at the same time you can still protect the service requirement
for interactive traffic. Yes.
>>: Another question, I know you don't have an answer here, but how does this [indiscernible]?
>> Chi-Yao Hong: Oh, you want to ask the differences?
>>: [Indiscernible] update?
>> Chi-Yao Hong: So the focus is a little bit different, and we are able to do rate control here,
and the context is this is a wide-area network and we are able to do things for background traffic.
>>: I have a question about the trend. So if you project five years from now, how is this
problem going to change? Are we going to have more datacenters? Are we going to have more
traffic running over the datacenters? Are we going to be able to scale better with just having the
centralized solution? Like how things will change, projecting into the future?
>> Chi-Yao Hong: Sure. Of course, we will have more datacenters. People are actually
building new datacenters, and you expect to have more traffic running and higher capacity in the
network, as well. But again, the fundamental challenge is still there, if you look at those. Those
network goes high capacity, across continents, those are still very expensive, and you can't afford
to do heavy over-provision for them, and that's why we still need the architecture. It's just the
workload can change a little bit and the network scale will increase. Those are the things we are
trying to take into account in this design, as well.
>>: Do you think as we have more and more datacenters, does it make any sense to actually
stack and partition these datacenters and say some of them are going to handle user-facing traffic
and some won't?
>> Chi-Yao Hong: That could be one solution, but the thing will be, that will be less efficient in
the sense that if you do, say, hard isolation and then efficiency comes from that. Yes. All right,
how much time I've got? 11:27. So when should we make -- how much time should I make
stop, 10 minutes from now?
>>: Yes, [indiscernible], 11:45?
>> Chi-Yao Hong: Twenty minutes, 10 minutes from now?
>>: [Indiscernible].
>> Chi-Yao Hong: All right, all right. So I still want to spend some time to talk about my future
plan, so I will try to be brief to talk about this part because of time limitation. So this is ongoing
work. I'm studying how to push the scalability limit further, to make the central transport rate
control to be more scalable and more real time. So essentially, what we see here is a tradeoff
between the scalability and flexibility. So if you look at today's network transport protocol, they
are mostly distributed, and they are very scalable in the sense you can do fine-grained for TCP
connection level scheduling, things like that, but they are not very flexible, right? And SDT is
quite flexible, but it's not that scalable. You cannot do fine-grained control. So the interesting
thing is, we try to push them both and see how far we can push towards more fine-grained and
large-scale control in real time. So just a couple ideas I want to talk about, and one is to be able
to scale up to the large datacenter network size, what we do is to do flow differentiation. We
handle long flow and short flow in very different ways. So long flows we handle centrally while
we let loose of the control of most optional flows in the network, and the intuition here is that
because of the datacenter traffic distribution, very heavy tailed, where most of the bytes are
generated by the long flow, while most of the flows in the network are short. So by doing so,
we're able to improve the scalability by an order of magnitude and still controlling most of the
bytes in the network. That's the rationale here.
>>: [Indiscernible] more data that's transferred?
>> Chi-Yao Hong: Yes, that's the amount of data you want to send.
>>: You always know that?
>> Chi-Yao Hong: You don't know that before. So one easy prediction I will talk about just
now. Okay. Yes, so that's the central architecture. We have a logically centralized controller,
controls the end host transport sending rate, and when the flow started, we assume it's a short
flow. So it initiated with any transport protocol it allows you to use. You don't have to talk to
the transport controller. You just say, if it's your flow, you'll finish in time. And in the network,
we provisioned it to use high priority, so you don't have to compete with the other long flows.
And only when the flows last in the network, when it sends more than a certain number of bytes,
then at that time point, we classify them into long flows. And they will send the flow demand,
hey, this is how fast I want to send in the network to the SDT controller. And the SDT controller
computes the resource allocation based on the transport policies like fairness, priorities, those
transport policies. They want to improve. And then, once they are computed, they allocated the
rate back to the end host, and the end host will do rate limiting to enforce this allocation. And
then fall back to send with the rate given by the central controller, and this could happen multiple
times across time. If there's other flows coming in, then the flow rate can be updated. And then
it will also fall back to low priority in the network, so now it won't compete with the high
priority. Yes.
>>: What's your user model?
>> Chi-Yao Hong: User model?
>>: As in are the flows minutiae or [indiscernible]?
>> Chi-Yao Hong: So for now, we assume -- we don't look at cases where the users can game
the system, and that's something very interesting I'll talk about in the future plans, as well, where
users usually have intentions to game the system by either using multiple TCP connections or
claiming the actual flow here. By splitting the traffic into multiple connections, that's something
we want to deal with later.
>>: So another interesting idea to scale up this is that if you want to use a centralized controller
that has limited computational power, how fast we can recomputed this resource allocation is
we'd run parallel flow rate computation algorithms. You use multi-threads to leverage today's
CPU architecture and multi-threads to compute. So yes.
>>: So can you think of the [indiscernible] as essentially as [indiscernible], or is it a different TE
with a different code?
>> Chi-Yao Hong: It's more like a transport rate control target here we're looking at.
>>: [Indiscernible] variables and what's the goal?
>> Chi-Yao Hong: The goal can be specified by the operator, so a couple of things in our
current implementation, including prioritization across in the flow level, can prioritize. Assume
you are able to emulate an infinite number of priority queues and also the weighted max-min
fairness by implementing fair queues. Yes. All right. So the idea here is -- I'm running out of
time. I'll try to stay a little bit. So the idea is to run the flow-level simulation at the controller to
compute how fast you can allocate to each flow rate. So let me try to give you just this one
simple example. This happened inside a controller, where you have the view of the whole
network, and you have input flow demand where it flows through. You know the path it takes,
and then you want to decide how much rate you are going to allocate this flow, so you have input
flow demand, and each link is handled by just one thread, and they do the resource computation,
like for example, fairness or prioritization, based on the input flow rate, how many flows coming
into this link and also the configuration by the transport operator and say what's the policy I want
to enforce and decide how fast each flow can send off. Suppose there are some fairness concerns
here -- hey, [rev] flow should get higher throughput and it allocates 0.7 and 0.3 here. And handle
this, offload this, give this information as the output flow rate, as the input flow rate to the next
set of links. And then, eventually, difference rates will compute until the results get to the
destination, and that's the flow rate you've got to allocate, tell the actual sources to be influenced.
>>: Are you going to synchronize across the biggest rates along the flow?
>> Chi-Yao Hong: Yes, interesting. Interesting question. One obvious thing is, for each link,
right, it's essentially multiple writers and one reader, so to be safe, this is a common case where
people put mutex to protect the concurrency issues. But we found what we have is a dirty bit
here to protect. The bit means that if it gets updated, you will update it, so you have to
recomputed the allocations. Otherwise, if the dirty bit is zero, you would have to recomputed it,
because it's clean, so avoid unnecessary computation. And because it's just one bit, you can
easily flip it in atomic fashion. And we found in this case, where if you do first clean it and then
send a clean before update and also mark it after the computation, in that case, we found we got
lucky, so we don't have to put mutex, and we showed that -- theoretically, we showed that you
won't get into the bad stage. You won't have the case where the link is actually dirty but you
mark it as clean, so you never get recomputed. Okay. So this is the prototype we built for
evaluation at UIUC, and we used 13 OpenFlow Pronto switches and servers that can drive up to
112 gigabits per second and then we used Floodlight OpenFlow Controller, and servers, we ran
on 112 Linux machines and used tc for rate limitation and also iptables for packet marking. And
due to the time, I will show you just one key result we found, out the controller scalability. So
here the access shows the network's size, and here we use the traffic distribution measured in the
datacenter workload, where we classify flows based on flow sizes, and then we try to see, given
network scale, what's the control interval, minimum control interval we can support using a
single desktop here. And the y-axis here shows the log scale control interval, where given
assumption that we can -- we handle the flows, mostly long flows, and give the threshold there.
Yes?
>>: This [indiscernible], so I'm a little bit surprised that you stop at four threads.
>> Chi-Yao Hong: Oh, that's because our current desktop only had four cores, and that's why we
don't see a huge improvement after four threads, but we are keen to test it over the other
computers, as well. But for now, we don't really what's the limitation here. It just looks like
we've got linear scale up for the first four threads. That looks a little bit promising. Yes. So
that's the case I want to highlight here. And another observation I want to show is, with only a
single desktop, we already scaled up to several thousand of servers with sub-second and
controlled intervals and, at this range, we're still able to handle more than 96% of the total
network bytes by letting loose most of the short flows in the network.
>>: [Indiscernible] training task for the two [indiscernible] updates for flows? Like, if you have
a graph, rather than run a [indiscernible], you could go fall back on combinatorial algorithms.
Have you tried any of those, or you compared with them?
>> Chi-Yao Hong: What's the algorithm you have in mind?
>>: There are a lot -- for example, the people who did [vague] configuration. I don't have a
specific algorithm in mind, but it feels like there's a space of those ->> Chi-Yao Hong: For now, we haven't tested, compared with the other algorithm. That's
something that it could be interesting to compare with. Yes. All right. Sure.
>>: I have a question over here. So if you look at the y-axis equals one second, the red threads
that can do 2,000 servers. Is my reading correct?
>> Chi-Yao Hong: Yes.
>>: And the blue line can do how many servers?
>> Chi-Yao Hong: Roughly 8,000.
>>: So it's linear. You basically -- yes.
>> Chi-Yao Hong: Yes. Is that the question?
>>: Yes. It just feels to me like it's perfect, right? It's like four threads do four times the -- so I
don't get the ->> Chi-Yao Hong: You may get saturated later if you had more threads. Essentially, what the
current observation is, we have enough L2 caches, so most of the stages can be put in caches,
and that's why multi-threads can run things faster.
>>: But it seems like there's no overhead to the synchronization of the threads anymore, because
with four threads, you do four times the work of one thread, so it's ->> Chi-Yao Hong: You may get saturated later, when the total state is larger. For now, and the
operation key intuition we got, it's because we can put all the things in caches, and also, we don't
have to put mutex, so there's no blocking across threads. Yes.
>>: It also depends on the pattern of the flows. Flows are fairly non-overlapping, and you don't
have that much mutexes.
>> Chi-Yao Hong: Sure. Good point. I'm going to skip this demo. This is just one demo, I
want to show how this works, and the prototype evaluation, due to time limitation. So I want to
briefly talk about my future research plan. My future research plan is tot year to extend the
current design and try to build the cloud network operating system. So what this is, is essentially
a whole collection of software to help you to manage the network resources in the cloud
networking and also provide the right interfaces to the network operators to help them to more
easily manage their network and to run network more cost efficient and provide high
performance. Okay. So what this is telling you is, essentially, we're replacing this part to enrich
this part, the purple part, to build a network operating system that allows you to do network
resource management on this layer and then expose API to the network applications, users,
operators, so they don't have to worry about what's the underlying distributed system inside the
cloud. They will express the requirement they need -- for example, high-level policy they want
to reinforce in the network, without worrying about the detail about, like, say, how to manage the
memory, switch memory, how to set up the forwarding paths. They don't have to worry about it.
Yes. So this is a big direction and big vision here, and I want to specify a few more specific
research directions I want to look into, so one of them is to make the resource management more
efficient. So, for example, where you place the virtual machine today is, today, what people do
is they want agility in the network, so they don't care about -- they don't have to worry about
where they place the VM. But with SSDT architecture, you have a great knowledge about the
network for the planned and the current traffic and also the current other sources. You are able
to do a much better job by placing the VM without competing with the other VM's traffic in the
network to make the whole allocation more efficient. That's something I am very interested to
look at. Another thing is middle boxes. I think people also pointed out that there is a firewall
event or even WAN optimizer you want to place in the network. So at one extreme, you can
place those middle boxes everywhere and you pay high cost. At another extreme, you place
them at just one central firewall, but then you have to redirect all the traffic to go through that
firewall, to make sure that not imprudent traffic that does satisfy the policy will be blocked. So
interesting idea is about how to integrate this design with our SDT architecture to make the
whole middle box either -- both the placement issue and what's the loading configuration there to
have a joint design with our current architecture to make the middle box placement question
more efficient. Also, the network extension, for it to add new capacity.
>>: Isn't there a dynamic placement?
>> Chi-Yao Hong: It could be dynamic, yes. Yes. It depends on the current workload, for
example. It depends on how you implement it. If you're implementing the software, it could be
very flexible.
>>: So it's similar to VM placement.
>> Chi-Yao Hong: Similarly, similarly, yes.
>>: Similar to VM placement, you [indiscernible] inside the VM.
>> Chi-Yao Hong: So this is more like you can change your traffic mattress, as an input to the
network resource optimization, and this is more like you're adding a new constraint to the
network when you do network optimization. So kind of all related, and people are doing this
isolated, so what I proposed is to try to look into this and build a giant operating system that
solves all these questions, provides the right interface to the users. And this also, management
switches resources. One interesting idea is to compress forwarding rules because we have very
limited memory, especially with [indiscernible]. But that also may complicate the virtual
updates if you compress the rules. That measures on the similar IP blocks. Let me come back
first to updates, so how to make things run very efficiently while still transparent to the network
operators who want to add network transport policies. That's the key things I want to look at.
All right, last two minutes, I want to talk about another direction I want to look into, is about
network performance versus application performance. So what this is saying is if you look at the
base networking, many people try to optimize the network-level performance metrics, such as
throughput, packet delay fairness or network utilization. Those are useful for network operators,
but if you think about applications, they care about something different. They don't care how
fair it is. They don't care what's the utilization. What they care more is for certain types of jobs,
they care when the job got finished, got complete, so they can send back the result to the users.
So I want to look into this like how does current protocol network-level improvement can be
translated into the application level? Do they enjoy having this nice network, high-performance
in the network matrices? And if we have a better understanding of this, it would be great,
because it will help us to decide the right abstraction we want to expose to the application. For
example, what we want from the application, what they really need there. All right, and the last
thing I want to briefly mention is to design an incentive compatible resource scheduling to make
the whole network resource allocation more efficient. Essentially, use of incentives to game the
network and they can claim to have infinite demand. They can claim, I've got the highest
priority, this is very time critical, or I can also create multiple connections and say each one is a
shortfall, for example. Try to game the system to get more resources, and that will make the
whole resource -- because the input is heavily biased, and no matter how good resource
allocation you have, if the input is heavily biased, then you cannot make very efficient decisions.
So a couple challenges here I wanted to look into, including how to do maintenance and design
to provide right incentives to the users, and also system design challenges like how to do realtime usage monitoring in a lightweight fashion, passive way we mentioned, and also for those
applications who try to game the system, how can we penalize them? How can you, say, discard
their traffic or put them in the low-priority queues? How to do enforcement there? And also, we
want to ensure there is a worst-case performance guarantee for the cooperative services. And if
this can be done, this will be great, because essentially what this is doing is encouraging people
to have early traffic demand declaration, and that's also good for our network operating system,
because we can make a much better decision based on this. Yes. All right, so if you don't mind,
I can take another just one minute to talk about the other related work I have done so far. I know
I have got to meet many of you here, and if you're interested in any of these papers, we should
talk in person. So those are the two papers I mentioned in most of this talk, and there's another
paper, we show how to do preemptive scheduling, more fine-grain resource scheduling inside the
datacenter network to complete flow faster and meet flow deadlines. And there's some early
work I have done in wireless domains, also the resource scheduling problem, where we look at
how to do link scheduling with the constraint of interference and to especially use the improved
spectrum efficiency. And there is also network defensive systems, where we look at a coupled
network security projects to build defensive systems to improve the network security, such as the
first one is called [Bitminer], where we look at the IP addresses of interesting IP addresses,
where there are lots of users behind those IP addresses. Some of them are legitimate users
created by a proxy or gateway behind middle boxes, but some of them are created by attackers
and try to abuse the system. So here, we try to do classification and make a case where, if you
look at their network level activities, we can actually be able to catch lots of bad users who
create proxy-like behavior there. And, also, there's an anomaly detection project I have done to
look at the trouble tickets and user care tickets and try to do anomaly detection for those traffic,
using both hierarchical heavy-hitter techniques and also the time series analysis. And there's
another early project, BotGrep, where we look at the network complication traces, try to identify
if there is peer-to-peer bots in the network by using the random work technique and also we do
clustering algorithms there. There is a network management project I have done about
understanding the root causes of the Internet clock synchronization inaccuracy, where we
identify the most inaccuracy component comes from the asymmetric of the path it takes in the
Internet, and we propose several solutions to compensate from that, to improve the inaccuracy
there. And also there's a datacenter topology design project, where we proposed you just connect
your datacenter randomly and you get much higher total throughput than today's structured
topology like [indiscernible]. So I'm happy to take any questions, and this concludes my talk
here. Sorry, I ran a little bit late for two or three minutes. If there is any question, I'm happy to
take. Thank you.
>>: I have one question.
>> Chi-Yao Hong: Yes.
>>: Yes, I'm just kind of trying to understand your kind of perspective on this. Everybody these
days talks about building network operating systems.
>> Chi-Yao Hong: Sure.
>>: So as you were describing that, I was trying to parse apart what do you think is different in
your version of network operating system versus as it's being talked about tens of other people.
>> Chi-Yao Hong: Yes, yes, so many people talk about, for example, building virtual topology,
building the network isolation, performance guarantee, building the nice abstraction, exposed
mostly to the cloud computing side where you have customers you want to use. For example, in
the Azure -- they have very different forecasts, right? They look at how to make the network
programming more easily if they want to set up a web server in parallel fashion, how to integrate
it with .NET, for example. Those other things are interesting components, but not my main
focus is a focus on network resource scheduling. Essentially, what I want to look into is kind of
a little bit lower layer, where we try to make the whole network more efficient running by having
a nice resource allocation, including how to set up middle boxes, how to set up the gateway, how
to set up the right network forwarding path and how to set up the end host rate control. So those
are resource-related projects I have found it especially more interesting to look into. Does that
answer the question? So if you want to ask for a more specific question, more specific project I
want to look into, then those are the good cases, like where to place the middle boxes, where to
add the new capacity in the networks.
>>: Such as the [indiscernible] placement project, there's work coming out of Cambridge that
also talks about embedding VMs such that their requirements, network requirements, are met by
an underlying topology. How would what you're suggesting be different?
>> Chi-Yao Hong: The suggestion is you should integrate that with the other interesting,
important factors about resource allocation in the network. So, for example, you are able to
change the network forwarding path load balancing across. That changes a lot to the solution
quality. If you look at these, just where to place your VM to better match your current topology,
you can do very limited things here, right? So the proposal here is to try to solve the problem
jointly and try to make the wise decisions based on multiple resources in the network services
and try to provide a more globally efficient network.
>>: On this issue of joint optimization, it sounds always repeating, but if you just look at
operating systems, would you say operating system schedule is a layer of application
performance? Or is it an independent load on a substrate?
>> Chi-Yao Hong: So there are things you can decouple, right? And there are things you cannot
decouple for the performance efficiency reasons. As if you look at operating systems, they do,
say, job-level scheduling. Of course, they also do other things like, for example, task resource
scheduling, memory scheduling and CPU scheduling, kind of isolated. And those are the things
you can decouple a little bit without losing too much efficiency, but what things we are looking
into here is those are closely related and closely coupled. If you don't do very careful, say, VM
placement, then essentially what you get is an input as the traffic demand matches to the
network.
>>: But given the operating system [indiscernible], I could come up with scenarios where if I
don't do joint memory and disk allocation, bad things will happen to application performance.
>> Chi-Yao Hong: So people just do that. People just do that. If you look at the real-time
operating systems, they care more about each job's service requirement, like the deadline things.
They do more careful scheduling there, but not in the commodity, normal PC, where jobs does
not have certain -- that critical deadline to meet. So if you can put additional effort there, of
course with a cost, you have to get a knowledge about different components -- you potentially
can do a much better job there. Yes.
>>: Let's thank the speaker again.
>> Chi-Yao Hong: Thank you.
Download