>> Ratul Mahajan: Well, good morning. Thank you,... what should be hopefully the last talk of this marathon...

advertisement
>> Ratul Mahajan: Well, good morning. Thank you, everybody, for coming for
what should be hopefully the last talk of this marathon talk week.
Today we have Ethan from University of Washington just across the lake. Ethan
is going to actually tell us about, like -- his work is very interesting. He's solved
some, like, longstanding problems with the internet, but in a very practical and
usable manner. And he's going to tell us about that.
But apart from a really good networking researcher, I can testify he's also a really
good skier. He spent a few years being just a ski bum somewhere in Utah or
Montana. I forget. I've seen him ski. He's very good. So that's kind of another
reason for us to, you know, get him here. He can improve our average skiing
level.
>> Ethan Katz-Bassett: I'm also happy to take questions about skiing if there's
anything particular that you're working on. I was a ski instructor so I can pass
some of that on.
Thanks Ratul. Thanks, everybody, for coming.
So I'm going to try to convince that you the internet is much less reliable than it
needs to be but that we can make substantial progress both in improving its
availability and performance without requiring any architectural changes, and I'm
just going to talk about a set of distributed systems that I've built to get us
towards that.
So I want to start out just with a simple survey. Can you raise your hand if you've
used email or used the internet since you got to this room.
All right. So many people.
How about in the last hour?
More people.
Today?
Probably just about everyone. Anyone who didn't raise their hand is probably
just busy checking right now so they didn't actually notice that I was asking.
So my point is that we rely on the internet. And I think even though we all know
that we depend on the internet, it's still astonishing when you see some of the
actual numbers. So over a quarter of the world's population is online today.
In the time that it's going to take me to deliver this talk, humankind collectively is
going to spend over a millennium on Facebook. And actually in the 10 minutes
that it took me to put together the statistics for this slide Kanye West sent three
new tweets.
There's also a lot of money involved. So companies like Google, like Microsoft
are making billions and billions of dollars in ad revenue. And because we
depend on the internet so much we really need it to be reliable.
So traditionally we've used the internet for applications like email, like web
browsing. E-commerce site have depended on it being reliable in order to deliver
their business. Now we're accessing data and applications in the cloud, we're
using the internet to watch movies, to make phone calls, and we really want our
cloud data to be reliably available, we want our movies and phone calls to be
smooth. But, you know, I use Skype to try to do meetings with my advisors
sometimes and we've all experienced having to, like, hang up multiple times
before it works well.
And in the future we're moving towards a world where we're even more
dependent on the internet. We're going to have more of our applications in the
cloud, we're going to have most of our communication over the internet and we're
going to want to run critical services across the internet. For these applications
we really need to be able to rely on it. We need high availability and we need
good performance. So we're just asking, you know, are we ready for that?
I'm going to show you that the current answer is no but that we can take some
concrete steps towards improving it. So I'm going to start out just by talking
about the current situation.
So actually we all track internet availability every day. Seems like usually most
things work. Sometimes they don't work. But each of us is only checking from -we're only seeing a small slice. We're only checking from work, we're checking
from home, and there's actually hundreds of thousands of networks that we could
be checking.
Places like Facebook, Microsoft are having people working full time to try to
maintain availability, but they still struggle with these problems sometimes so we
can really ask what about everybody else? I decided to assess reachability on
an internet scale continuously and I built a system called Hubble that checked
over 100,000 networks around the worked and it checked them from a few dozen
computers around the world.
Each of these vantage points would periodically check to see if its route to all the
other networks still worked. All of these networks were available from all of the
vantage points at the beginning of the study. And as I started making these
checks I realized that there were problems happening all the time all around the
world.
At any point in time there are locations all around the world that are reachable
from some places and not reachable from others. So this is a snapshot that was
generated automatically by my system, Hubble. Each of the balloons here
represents a network that had previously been reachable from all of the Hubble
vantage points. It had become unreachable from some of the vantage points
while remaining reachable from others.
So we know since it's still reachable from some that it's not just the other
networks offline. And these partial outages shouldn't really happen. There's
some router configuration that's keeping some of the vantage points from being
able to reach the destination.
So in this case the darker the balloon, the longer the problem had lasted. So the
lightest color is the yellow. Those have been going on for up to eight hours. The
darkest color is the maroon. Those have been going on for more than a day.
So there's these really long-term problems that aren't being fixed, and if you're
actually behind one of those problems you're basically stuck.
>>: How do you know it's a router configuration issue and not just not having
enough redundancy at the level of policy [inaudible] and you could just
disconnect it without any kind of configuration error?
>> Ethan Katz-Bassett: So we know that it's not just a hardware failure because
hardware failures can only result in this partial reachability if there's a network
partition, and there's not a partition because both the one that could reach and
the one that couldn't can talk to the University of Washington. It could be policy
and the policy is encoded in the router configuration. So in that case it's not
necessarily a misconfiguration but it's an issue with the configuration of routers.
In other words, if they reconfigured the routers, changed the policy, then it would
become reachable.
So I'm not getting super specific about whether it's a misconfiguration or not, but
it has something to do with configuration. It's not just a fiber cut.
>>: So are these all permanent nodes?
>> Ethan Katz-Bassett: We were checking from the planet lab nodes to every
network that we could find a pingable address in.
>>: So these [inaudible] are just to some network address in sort of matched
network?
>> Ethan Katz-Bassett: Yeah.
>>: And how do you check the [inaudible]?
>> Ethan Katz-Bassett: So we -- the vantage points each periodically ping to see
if the destination is still responsive. If it isn't responsive from one of the vantage
points for a couple pings in a row then we trigger trace routes from all of the
vantage points and we see whether the trace route reaches the network.
So we don't care if it reaches the particular end host. We just see if it reaches
the network. So in all these cases there were some trace routes from some
vantage points that reach the network and in most cases also reached the end
host. And some of the other trace routes weren't even getting to that edge host,
so they weren't getting the whole way across the internet to the network.
>>: So are these problems close to the edge network or it's somewhere in the
middle of it?
>> Ethan Katz-Bassett: So many of them are fairly close to the edge, but they're
not right at the edge, because if it actually reaches the last network then we don't
consider it a problem anymore. Some were in the middle. I honestly don't have
a great characterization of how many were in the middle and how many in edge,
but I think the instinct is that a lot of them were towards the edges, right?
>>: Yeah, I think once [inaudible] edge because the [inaudible].
>> Ethan Katz-Bassett: So we tried to eliminate as many of those as possible by
using every available technique for mapping to the correct AS, and it's certainly
possible that some of these cases were issues like that, but some of them when
you actually looked at them were obviously happening closer to the middle of the
internet.
>>: So one thing's not clicking. So ->> Ethan Katz-Bassett: Can I just finish one more sentence on the next question
quickly?
There's some statistics in the paper on how many looked like they were in the
last but one AS, and I don't remember the numbers off the top of my head, but it
wasn't all of them. And so those other ones, it couldn't be this boundary issue at
least at the edge AS because it wasn't even looking like it was in the previous
one.
Yes? Sorry.
>>: Okay. So there's two possibly -- I'm trying to understand which two cases
here, which is these networks you can't reach, is it the case that some -- that at
least one of your sample points can't reach them or that all of your sample points
can't?
>> Ethan Katz-Bassett: So it's the case that some can and some can't. And we
had a threshold for a couple not being able to because if just one did, then it
could just be a problem with that particular vantage point.
>>: In every case at least someone else could -- someone in the sample set
could reach it?
>> Ethan Katz-Bassett: Yes. We tracked the ones that are completely
unreachable. But for the ones on this map, they're partially reachable, because
those are the more interesting cases.
>>: Okay
>> Ethan Katz-Bassett: So in fact it's not just these outages that matter.
Sometimes you can actually get to something, but it's just too slow to use. And
providers are aware and concerned about these slow routes.
So I have up here statistics. Google published a paper in 2009, and one of the
stats they had in there was that 40 percent of the connections from Google
clients had a round-trip network latency of over 400 milliseconds, and that's just
the network latency from the client to the Google data center and then from the
data center back to the client. It doesn't include any processing at the data
center and at the client.
And this is even though Google has data centers all around the world and tries to
direct each client to the one that gives us the lowest round-trip time.
So to give you some idea of 400 milliseconds, it's as if the traffic was leaving the
client here in Seattle, going around the equator to the data center and then
leaving the data center and going around the equator again before it goes to the
client.
These slow routes have a direct impact on business. I have there a statistic from
Yahoo. An extra 400 milliseconds of latency led to an additional 5 to 9 percent of
people clicking away from the page before it even loaded, and this leads to a
direct loss in business.
So these huge companies have slow routes even though they have a big
monetary incentive to have high performance. They need a way to try to
understand these problems, and they currently lack the means to guarantee
performance.
Yes?
>>: So 400 milliseconds is really big. Based on the [inaudible] we did from our
data center to the edge of the network, almost [inaudible]. I suspect most of
these are due to the latency of the last route rather than the internet. If the
latency of the last route [inaudible]
>> Ethan Katz-Bassett: Sure. So they had some further statistic thing, and that's
not -- I'm not trying to say that that result is what I would say is bad routing. It's
just trying to point at the problem that they're concerned about. And they had
some more statistics in the paper when they tried to break that down. It's a good
paper. Worth the read.
They found that I think -- so that statistic was on client connections. They also
look at prefixes, and I think they found that 20 percent of prefixes had what they
considered to be an inflated latency that was addressable through routing
means.
So the current internet doesn't deliver reliable availability or performance. That
means that we can't really use it for critical services. And if you talk to operators
at major providers you'll learn that they end up spending a lot of their time trying
to explain and fix these problems.
There's been lots of interesting and important work done on how we could
redesign the internet protocols to improve this situation, and in fact I've worked
on some of that.
We're also really stuck with the internet that we have now. We need to improve
the internet that we've got because there's so many entrenched interests, we
want to improve the services that exist on it now and then enable future services.
So I'm going to show you how we can actually do that without requiring any
changes to the protocols or the equipment. There's two component to my
approach.
First I build tools that provide better information and operator can use these to
troubleshoot the problems. Once you have these tools, I built please systems
that use these new tools as well as the existing ones to try to help move us
towards an internet that can actually fix itself.
Problems that I address really arise from routing. There's difficult to detect,
diagnose, repair. I'm going to explain why later in more detail, but for now there's
two big issues.
First, the internet is one of the most complicated systems that humankind has
ever built, and it has emergent behavior. And, second, it's a huge collection of
these independent networks. They have to work together to enable routing, but
the operators don't necessarily share information.
So to improve this situation we have to address a range of research challenges.
We have to be able to understand topology and what routes are available, we
have to monitor availability and develop ways to improve it, and we have to
understand performance and troubleshoot problems.
I've worked on systems to address all of these challenges. Today I'm just going
to talk about a few of them, just for time reasons. I'm going to talk about my
reverse trace route system which lets you measure the path that anybody is
using to get to you, and I'm going to give a brief example of how you can use it to
troubleshoot performance problems, and then I'm going to talk about a set of
techniques that I'm working on to address availability problems.
So that gives the basic outline of the talk. In order to understand the work, I'm
going to first have to walk through some background on how routing works and
what tools are currently available just so you can understand how the problems
arise, and it's going to be very high level. Most of you are already going to know
it.
Then I'm going to talk about reverse trace route. It's a measurement system that
I built that provides information that you really need to troubleshoot the types of
problems that I'm talking about. I'm going to show how operators can start using
it to debug performance problems.
And then the final part, as I said, I'm going to turn to availability problems. I'm
going to talk about some systems that I'm working on now that use reverse trace
route to make steps towards automatic remediation of outages.
So I want to make sure we have the same basic view of what the internet looks
like. It's a federation of autonomous networks. Each of these clouds is one
network.
For a client up at the University of Washington to talk to the web server down at
the bottom there traffic is going to traverse some set of these networks. Within
each network it's going to traverse some set of routers.
Now, many of these routes tend to be stable for pretty long periods of time, even
days and days is fairly common. And I'm going to give an overview of how these
routes are established.
So BGP is the border gateway protocol. It's the internet protocol to establish
these internet network routes. The way it works is the web servers ISP will
advertise the set of addresses that is available and it will tell this to its neighbors.
So now AT&T has learned the routes and now AT&T has a direct route to the
web server's ISP. It will advertise that to its neighbors.
Now Level 3 and Sprint have routes through AT&T to the web server's ISP, to the
web server. They're going to advertise those on. So now University of
Washington has to learn two routes.
Now, one key with BGP is that routes generally aren't chosen for performance
reasons. BGP let's each network use its own arbitrary policy to make decisions,
and it's an opaque policy so it's generally going to be based on business
concerns.
So in this case let's say University of Washington has a better deal from Level 3,
so it chooses that route. So now traffic from the client at University of
Washington to the web server is going to follow some path like this.
>>: And the whole path is transmitted out? So UW can actually make the
decision and say it doesn't want any traffic to be through Sprint?
>> Ethan Katz-Bassett: It sees a network-level path, yes.
>>: [inaudible]
>> Ethan Katz-Bassett: It receives the message that essentially it looks like that
and it has which addresses are at the end of that path. Exactly.
>>: But at some point [inaudible]
>> Ethan Katz-Bassett: It's not necessarily enforced, but it will follow that path.
But at least it's seeing the route that it thinks it's selecting.
>>: I guess you don't have a good case here, but if you have a case where
depending on whether it went through L3 or Sprint that there was a third network,
then if it went through Sprint it wouldn't go through a third network, but if it went
through L3 it would go through the third network, it sees enough information that
it could decide to send packets through Sprint ->> Ethan Katz-Bassett: Yeah. And that's a great example of the opaque policy.
It's allowed to use whatever policy it wants to make its decisions.
>>: [inaudible]
>> Ethan Katz-Bassett: It could make decisions based on what appeared in the
path. Generally not, it doesn't, but it could.
>>: But there is sufficient information in the protocol that it could, right
>> Ethan Katz-Bassett: Yes.
>>: But it can't actually specify -- the only [inaudible]
>> Ethan Katz-Bassett: It can't enforce the path is that it takes, but it could select
the path based on where it thought it was going to go and try to avoid these
paths if it wanted to.
So another consequence of this policy-based routing that paths are often
asymmetric. The way that that can arise is the University of Washington will
advertise what addresses it has available with a path that just says, hey, send it
to the University of Washington. It tells that to its neighbors, so now Level 3 and
Sprint have direct paths. They're going to advertise that onto AT&T, and, again,
AT&T can use whatever policy it wants. L.
Et's say it prefers the path through Sprint. It's going to select that path, advertise
that onto the web server's ISP, web server's ISP has a path through AT&T to
Sprint to the University of Washington. So now traffic back to the client will go
this way and we have asymmetric paths.
So portions of the paths may overlap, but generally they're often at least partly
asymmetric.
Now, each of those clouds that I had represents a network, and in the case of big
ones like AT&T or Sprint, they actually span multiple continents. Each of those
networks is made up of routers, so we have to go from this internetwork path to a
router-level path.
Here we have the University of Washington's path that it selected which went
through Level 3 AT&T to the web server's ISP. I have a representation of the
Level 3 network shown in red, AT&T shown in blue.
One possible path through this network looks like this. So Level 3 carries it
across the country and then gives it to AT&T. That might be the shortest path
through the network. But, again, the individual networks aren't going to
necessarily be optimizing performance across the internet so it actually might be
more likely than the traffic goes like this. Level 3 is going to give it to AT&T as
soon as it can because it incurs costs for carrying it. This way AT&T will incur
the cost of carrying it across the country.
So the point is that the end-to-end performance that you get and also the
availability to get depends the interdomain routing decisions and also the
intranetwork routing decisions.
This means than the types of problems that I'm talking about are problems with
routing. So a performance problem might be a geographically circuitous route
like this and availability problem might be when you have a route advertised in
BGP but when you actually send traffic along it, it doesn't reach the destination.
So we're going to need to drastically improve the performance and availability to
enable these future applications we want so we really need a way to understand
routing to try to troubleshoot the problems that arise.
One attribute of the internet's protocols is that they don't naturally expose that
much information. So you might have good visibility into your own local network,
but it's hard to tell what's happening in other networks.
Furthermore, those other networks don't really have an incentive to tell you
what's going on. So if Sprint has a problem, they don't really have an incentive to
inform AT&T exactly what the problem is.
That means that we need tools that can measure the routes given the restrictions
of what the protocols and other networks are going to make available.
So traceroute is one such tool. Probably many of you have used it. Traceroute
lets you measure the path from the computer that's running it to any network.
And it's widely used by operators and researchers.
I'm going to give a basic overview of how traceroute works. So all internet
packets have a time-to-live, or TTL, field. The source sets the value. Each
router along the path decrements that by one. If it hits zero the router is going to
throw out the packet and it's going to source an error message and send it back
to its original source. So traceroute is going to manipulate these TTL values to
build up the path.
Here we're trying to measure the path from S to D, and the first thing traceroute
is going to do is send out a packet and set the TTL to 1. It's going to get to some
first router, say F1, F1 decrements the TTL value to 0, it's going to discard the
packet and generate an error message.
It's going to send this error message back to S and when S gets the error
message it can look at it, see that it came from F1, and now it knows that F1 is
the first hop on the path.
Then Traceroute's just going to continue this process. It's going to send out a
packet now, TGL2, gets to F1, F1 decrements it to 1, sends it on, gets to some
F2, F2 decrements it to 0, throws out the packet, source sends an error
message, sends it back to S. Now S knows the path goes through F2. And we
can just continue this until we've built up the whole path.
So if operators at the web server think that they're having problems with some
client they can use traceroute to measure the path to the client. But as we said,
routes on the internet are generally asymmetric, and the problem could be on the
path from the client back to the web server, and traceroute doesn't give any
visibility into this path.
First of all, the client's computer is going to be the one setting the TTL value and
it's going to set it to a normal value. So there's not going to be these error
messages generated, and the web server operators don't have any way to
control that.
And then even if there were error messages generated, they're gonna go back to
the client, and so the web server isn't going to observe them.
Now I'm going to give a real example of how that limitation affects operations.
I'm then going to show how we can address this limitation with my reverse
traceroute system without requiring any modifications to the existing protocols or
equipment.
I started out with some statistics about how many slow routes Google is
observing, and they built a system to try to adjust this. So this is an example
taken from their paper about that system. In this example they have clients, and
instead of networks in Taiwan, shown in red there, and they were experiencing
really slow network latencies of 500 milliseconds.
So the main thing that Google does to try to have good performance to their
clients is replicate their content at data centers around the world and send the
client to a nearby one.
So the first problem that their system looks for is maybe actually the client's being
served by a distant data center. And they can actually look at the logs at the
data center. In this case they verified in fact the client was being served by a
nearby data center in Asia.
Another potential problem is that the route from the data center to the client could
be really indirect. But their system was able to verify with traceroute that actually
it was taking a really direct path from the data center to the client.
Now, the communication is two-way. Paths are likely asymmetric. So at this
point they assume that the problem is on the reverse path from the client back to
the data center, but they don't have any visibility into that path.
So what they concluded was to more precisely troubleshoot these problems they
need the ability to gather information about the reverse path back from the clients
to Google.
In fact, there's been a wide-spread recognition of the need for this service path
information. So I attended a network operators troubleshooting tutorial, and I got
their quote from this tutorial. The number one go-to tool for troubleshooting is
traceroute, but asymmetric paths are the No. 1 plague of traceroute, and that's
because the reverse path itself is completely invisible.
So you can get a good idea about a problem when it's on the forward path, but
you're really flying blind if it's on the reverse path.
That gives the motivation for our reverse traceroute system. We want a way to
measure the path back to us from the client without requiring control of the client
and we want it to require no new functionality from routers and went it to use only
existing internet protocols.
So this is my reverse traceroute system that I built, and now I'm going to walk
through the basics of how it works.
So here's the setting. We want to measure the path from D back to S. We
control S, and we can install software there, but we don't control D.
So the first thing that you might ask is why not just install software at D. And
actually a few years ago we had extensive conversations both with Microsoft and
with Google to try to convince them to do something like this and they really
weren't willing to do that to get the information that way. They wanted a way to
work with what was available and enable operators to fix the problems without
requiring the participation of the client at all.
So, again, what can we do in this setting? Well, we can measure the forward
traceroute, but the path is likely asymmetric so it's not clear that that gives us that
much information about the reverse path.
The next thing that you might think to do is use other computers around the world
that we actually do control, and that's what we're going to do. We do have
access to other vantage points. In my case I use Planet Lab, which is a test bed
that gives you access to computers at a couple hundred universities around the
world.
These aren't going to directly give us the path from D, but they're going to give us
a view that's unavailable from S and also we can combine the view of the
multiple vantage points to get a better view than is available from any one of
them.
There's only a limited number of them, so what can we do with them? Well, we
can start issuing Traceroutes to them to destinations around the world and we
can build up like an atlas of what routing looks like. And one set of paths that we
can measure are the paths from our vantage points back to S.
Our idea is to use the fact that internet routing is generally destination-based.
What this means is the path from anyplace in the network depends only on
where the packet is in the network and where it's going. It doesn't depend on
where it came from.
So this means that if we intersect one of these blue paths that we know about,
we can assume that we follow it the rest of the way back to S. So these paths
aren't going to give us the full path from D, but they're going to give us a baseline
of paths that we can use to bootstrap off of.
>>: So how do you deal with [inaudible]
>> Ethan Katz-Bassett: So there are a couple options there. We can use or we
do use [inaudible] techniques to try to expose the multiple options that are
available, and then when we return a path you can return the multiple options.
The other thing is that people have relied on traceroute even given these sorts of
multipath limitations now, and so I think just building a reverse traceroute system
that gives you a sense view into at least one of the paths is a great starting point,
and now we're starting to look into techniques to expose the multiple paths.
>>: [inaudible]
>> Ethan Katz-Bassett: They don't make that available. And most -- ACAMI
[phonetic] has pretty good visibility, but even then, they can't necessarily
measure the path from arbitrary routers where they don't have a presence, and
anyone else who isn't ACAMI is going to have many fewer points of presence in
that. So they're not even going to have that view that ACAMI does.
>>: I guess my assertion is ACAMI probably has better edge coverage than
Planet Lab.
>> Ethan Katz-Bassett: ACAMI certainly has better edge coverage than Planet
Lab. But I'm going to show you how we can use Planet Lab's vantage points, the
limited number of them, and actually bootstrap them to get much better coverage
than you'd think we might be able to.
So now we need a way to build from D until we hit one of these paths that we
know about, and actually destination-based writing is going to help us do that
also. If we're able to learn that the path from D goes through some router R1,
because of destination-based routing, we only need to measure the path from R1
back to S. We can ignore D.
So this means that we can build the path back incrementally. If we learn that it
goes from R1 through R2 and R3, now we just have to measure from R3 back to
S. If we learn it goes from R3 to R4, now we've hit a path we know about, we
can assume that we follow V1's path back to S and we've successfully stitched
together the path incrementally.
So I left that out one key piece, right? How do we get each of these segments
that we're stitching together? We need something in the internet that's going to
let us measure a path a hop at a time. And IP options are going to give us that.
IP options are a built-in part of the internet protocol. You can enable them on
any packet, but we're the first to recognize that you can use them to build a
reverse traceroute system.
The first IP option that we're gonna go use is the record route option. Record
route allocates space in the header of the packet for up to 9 IP addresses. The
first 9 routers along the path will record their IP address.
Now, the key with IP options is that they're going to be copied over into the
destination's response. So if we reached D within 8 slots the remaining slots will
fill up on the reverse path.
So to give an example of that, let's say that the paths look like this. With record
route we can get these yellow hops. We get five hops on the forward path, the
destination for the sixth hop, and then the remaining three slots fill up on their
reverse path.
So this is great. So if we're near enough to use record route we can get a few
hops at a time and stitch together the path.
The problem is that the average path on the internet is about 15 hops in each
direction, 30 round-trip. And so what are we going to do if we're too far away to
use this?
>>: [inaudible]
>> Ethan Katz-Bassett: So the best place to look for statistics on that is there
was an SIGCOMM 2008 paper called DisCarte by Rob Sherwood. I don't know
the exact number off the top of my head, but essentially they found that there
was widespread enough coverage that we started thinking about what to use
record route for that could enable some of these.
There certainly might be some networks where we don't have good visibility into
them. The key is that because we're just trying to measure back until we hit one
of these traceroute paths that we already know about, we often don't have to
measure that many hops.
>>: So in your study did you find that the majority of the paths support this
option?
>> Ethan Katz-Bassett: There are some networks that do, there are some that
don't. Lots and lots do. I think that -- I believe that the coverage just in terms of
networks that's supported is over 50 percent. But I'm not -- I don't know the
exact number off the top of my head. I'm going to have some results later on in
terms of accuracy, and you can sort of infer from those that we're getting good
coverage out of our techniques because we are able to measure the paths using
the combination of techniques.
>>: [inaudible] does the packet just get copied with the existing recorded routes
or do you lose all the information ->> Ethan Katz-Bassett: So there are different ->>: [inaudible] failure mode is that you get nothing or you get a silent [inaudible]
>> Ethan Katz-Bassett: Sure. So there are potentially both behaviors. There
are some places where the packet will be filtered and dropped, there are other
places where it might be forwarded and just you won't see that particular hop.
Similarly, with traceroute there are some routers that don't source these TTL
expired packets and so you get some blind hops there.
So in the DisCarte paper they have some statistics quantifying how many routers
have each of these different behaviors, and that's sort of what we built off of for
that.
In our paper we have some statistics on these reasons, like how many routes
might be hidden from -- how many hops might be hidden from traceroute, how
many might be hidden from record route.
>>: The destination operating system network stack has to respect record route
and understand on the response that it has to fill in the previous receipt?
>> Ethan Katz-Bassett: Yes.
>>: Okay. And that's why [inaudible] does Windows support it?
>> Ethan Katz-Bassett: I believe so. We don't tend to probe to end host. We
probe to the router right before the end host because that last hop will be
symmetric anyway. But it's a built-in part of the IP protocol so in general
everyone has -- our machines we tend to test on Linux, and it's certainly
supported there.
So this is going to work great if we're close. In the case when we are not close,
that's where we use the fact that we have distributed vantage points.
What we can do is find a vantage point that's close to the destination. So let's
say in this case that V3 is close to the destination. It's going to send out the
record route probe, but it's going to spoof, which means that it's going to set the
source address to S's address.
So we're the first to use this source spoofing to sort of separate out the forward
and reverse path and measure reverse path information.
So what's going to happen here, V3 is going to send the record route probe to
the destination. When it gets there some number of the slots will be filled out,
let's say seven, D will add itself in, and then it's going to reply, but it thinks that
the packet came from S so it's going to reply back to S.
When the packet gets back to S, S looks at it, it's now learned that R1 is on the
reverse path.
So this source spoofing is going to let us use any vantage point that's close to the
destination regardless of whether or not it's the particular source that we're trying
to measure to.
And now because of destination-based routing we can just repeat this process.
Now we need to find a vantage point that's close to R1. Let's say that V2 is close
to R1. It's going to send the packet, spoof, and claim that it came from S. When
it gets back to S we've now learned that R2 and R3 are on the reverse path.
Now we have to measure from R3 back. And suppose we don't have a vantage
point that's close to R3. We might still have some idea about what the set of
possible next hops are, and that's because, as I said before, we're measuring
paths from our vantage points to destinations around the world to build up an
atlas of routing.
So we can look at this atlas, and let's say previously we've observed a path like
this, we've observed another path like this. This means that R4 and R5 are good
candidates for where this path might go next.
So now we need a way, given one of these likely candidates, to verify which one
is right. And we're going to use another IP option to do that. We're going to use
the time stamp option.
The time stamp option lets you specify up to four IP addresses in the packet, and
if routers with those IP addresses are traversed in the sequence that you specify,
then they should record their time stamps.
The key here is that it's ordered. So the way we're going to use this is S is going
to send out a time stamp probe to R3 and it's going to ask first for R3's time
stamp and then for R4's time stamp. Because they're ordered, if R4 is traversed
on the path going to R3 it won't record a time stamp because R3 isn't recorded
yet.
So this packet is going to get to R3, R3 is going to record a time stamp, and then
it's going to reply back to S. When it gets back to S, in this case let's say that R4
is recorded a time stamp, we now know that R4 was on the reverse path.
So we don't care about the particular value of the time stamp. We're just using it
as a check of whether or not that router was traversed.
So we now know the path goes through R4, we've intersected a path that we
know about. At this point we're going to assume, with destination-based routing,
that we follow V1's path back and we successfully stitched together the reverse
path.
So I've now shown you how we can address this key limitation of traceroute and
measure the reverse path even though we don't have control of the destination.
There are a number of keys -- yes?
>>: A lot of access networks won't allow packets out if they have source
addresses that are not on the access network to prevent that kind of spoofing.
Do you find that that's a problem with a lot of the, for example, Planet Lab
networks?
>> Ethan Katz-Bassett: So it's certainly the case that spoof probes are filtered
some places. The most complete study that is from MIT, they have this spoofer
project where you can check from your machine whether or not you can spoof.
They found about 25 to maybe a third of -- 25 percent to 33 percent of sources
can spoof we found similarly with Planet Lab.
The key here is that you just need -- if a particular vantage point can spoof, it can
generally spoof as everybody. And so you just need a set that can spoof that are
well distributed and you can bootstrap the whole system. You don't need any
particular vantage point to be able to spoof.
>>: So you just throw out immediately any vantage point that happens to not be
able to ->> Ethan Katz-Bassett: Yeah. We check which ones can and retain the ones
that can't and I think it's on the next slide, we use about 90 right now.
So we're using these multiple vantage points to get a view that's not available
from any particular one, and we're using spoofing to let us select the vantage
point that's in the best position to make any particular measurements.
You need to adjust a number of things to actually make this technique work well.
So, for instance, some routers don't process these IP options correctly. But
we're able to account for these incorrect processing and still get useful
information in many cases. We have techniques for doing that.
Similarly, some ISPs filter the probe packets, and we have techniques to avoid
some of these filters and improve the coverage of the system.
Finally, you want to be able to measure without having to issue too many probes,
and we have techniques that let us intelligently select particular vantage points
that we can sort of maximize the return on any particular probe that we send.
So reverse traceroute is a real deployed system. I think as of this week we're
issuing around 100,000 reverse traceroutes a day. In our current deployment
we're using about 90 Planet Lab and measurement lab sites as our vantage
points. We're using Planet Lab's sites as the sources, so this means that you
can now measure paths back from anywhere to Planet Lab. You can actually try
out the system with a few resources at revtr.cs.washington.edu.
Now, when none of our vantage points is able to measure a hop from a particular
destination using any of our techniques we have to make an informed guess for
that particular hop and then measure back from that one hop. So this means that
the coverage that we get is actually tied to the set of vantage points that we
have.
What we've found in evaluating the system is that you actually need the full set of
techniques to get good coverage. So if you don't use time stamp or you don't
use record route or you don't use spoofing, the coverage will immediately drop
enough that you no longer have a useful system.
The overhead of the system is reasonable both in terms of the amount of time
that takes to make a measurement and in terms of the number of probes. It's
about 10 times that of traceroute right now.
I'm going to go into it in a little bit more detail and accuracy. So operator -- yes?
>>: I was wondering, do you have a sense of the shelf life of reverse traceroute
with respect to where the trends are as to what the routers are doing, like
[inaudible] is there a longitudinal sense at all?
>> Ethan Katz-Bassett: So there was a study down a number of years ago
where they concluded that IP options weren't supported enough to be useful. In
recently years DisCarte used record route, we've used time stamp and record
route, and we've seen better usage than they did previously. So I think that the
trends are in our favor.
Also, I've talked about this system to many operators who have presented at
operators conferences, and lots of them are signing up to use the system, and I
think that hopefully that should lead to sort of good will within the operations
community where best practices are to enable these options that you can make
she is measurements just like best practices now are to support traceroute in
your network so that other people can debug problems that they're having.
So because operators are used to using traceroute, we'd like our system to
return equivalent results so if you actually have access to the destination we're
able to issue a direct traceroute.
Now, we don't have access to ground truth information for the internet as a
whole, but we can actually evaluate this in a controlled setting. So we're going to
look at paths between Planet Lab nodes, we're going to compare a reverse
traceroute to a direct traceroute.
For the reverse traceroute we're going to assume that we don't have control of
the destination and use the technique that I just outlined. For the direct
traceroute we're just going to looking into the destination and issue a traceroute
and compare those two paths.
This graph shows the result of that comparison. On the X axis we have the
fraction of the direct traceroute hops that you see. And the graph is a CCDF, so
any particular point on the graph says that -- it gives you the tracks of the paths
that had at least that fraction of accuracy.
Right now without access to reverse traceroute information operators fall back to
assuming that the forward path and the reverse path are the same, so the black
line gives how well do you if you just assume the path is symmetric. In the
median case you get 38 percent of the hops right. So you're really not getting
that much information.
The red line is our system. In the median case you get 87 percent of the hops
correct. And so this whole shaded region is the benefit of our system.
>>: [inaudible]
>> Ethan Katz-Bassett: We use traceroute service as part of our system. You
can't issue the options probes from them, but we actually use them to build up
our traceroute atlas.
>>: [inaudible]
>> Ethan Katz-Bassett: Yeah, we have results on that too in the paper. So the
line -- if you sort of find the traceroutes or the path that's the closest to one that
you're trying to measure, the line is somewhere in the middle there. And we also
use those in evaluating accuracy. Just as we did Planet Lab to Planet Lab paths,
we did traceroute server to Planet Lab paths and it was almost the same. The
line is a couple percentages different.
>>: What are the sources of [inaudible]?
>> Ethan Katz-Bassett: So the two main sources are that -- so an ideal line, just
so make sure everyone's on the same page here on the graph, is a vertical line
at X equals 1. So what this graph shows is that in most cases we get most of the
hops right.
There are two main reasons we don't get that vertical line at X equals 1. The first
is that routers have multiple IP addresses and it can be hard to identify if two IP
addresses are in the same router. So one technique might be observing one IP
address. Another technique might be observing another IP address. So in those
cases we're essentially undercounting. We're getting it right, but we just don't
realize it.
The second main source is that when we're not able to measure, when we don't
have coverage of a particular network, we have to assume that that particular
hop is symmetric and measure from that one hop back. So this means that the
accuracy of our system will improve as the coverage grows. And in fact I've
helped Google deploy a version of reverse traceroute now and they have access
to more vantage points than I do, and so as you would expect, they get better
coverage than I do on my system.
>>: So what about the AS level if here is the IP level?
>> Ethan Katz-Bassett: So I -- in the paper we evaluate pop-level accuracy, so
sort of in between those two trends. In the pop-level accuracy I believe in the
median case we get 95 percent of the pops right. Maybe -- actually it might be
even be 100 percent of the pops right. It's in that range.
I helped Google do an evaluation on AS level accuracy and essentially what you
see there is that in cases where you have the coverage and are able to make
measurements from every hop, I think it's like 98 percent of the time you get the
AS path right, and most of the time when you don't it's sort of off by 1, so it might
be the sort of cases that you're talking about where you're unclear on the
boundary which AS you're actually in.
So I've shown you now how we can build a working reverse traceroute system. I
started out with this problem that Google was having with their clients in Taiwan,
and now I'm going to walk through an example of how you can use reverse
traceroute to address problems like that.
So in this case I measured a round-trip latency of 150 milliseconds from a Planet
Lab known in Florida to a network in Seattle, and this is about two or three times
what you might expect. Now, the network is in Seattle, but it wasn't affiliated with
the University of Washington. It was just sort of random that it happened to be
there.
With traditional tools, what an operator would do is issue a forward traceroute
and see if the path is indirect. So that's the first thing that I did.
So if you look at this path, you see that it starts in Florida, goes up to D.C.,
comes back down from D.C. to Florida and then continues across the country.
So this detour up to D.C. and then back down to Florida is explaining about half
of the inflation. But with traditional tools you only get a partial explanation. It
seems like there were less problems on the reverse path, but you really have no
idea exactly what's going on or who to call to get it fixed. With reverse traceroute
we can actually measure that path. So that's what I did.
So if you look at the reverse path you see it starts in Seattle, goes down from
Seattle to L.A., comes back up from L. A. to Seattle and then it continues across
the country via Chicago.
Now, if you look closer at this path -- so this detour down to
L.A. is explained in the rest of inflation. If you look closer at the path, you see
what's happening. It's starting out in Seattle in internap, it's going down on
internap to L.A. In L.A. it's switching over to transitrail, coming back up to Seattle
on transitrail.
So it's not necessarily a misconfiguration. To could actually be that transitrail and
internap only connect down in L.A. and so the traffic has to go down there. But I
was able to verify with a traceroute from the University of Washington that
actually they appear here in Seattle as well.
And I was able to talk to operators at these ISPs. They verified that this was an
unintentional configuration. So it might be that maybe it was out-of-date
information or something like that.
This type of misconfiguration is a common cause of routing problems, and it's
effectively operator error. There's this manual reconfiguration going on all the
time, it's easy to get it wrong, it's hard to understand the consequences of any
change you make, it's hard to understand the interactions of your change with the
rest of the network.
Without access to both directions of the path you really have no way to
understand what's going on or who to contact to fix it. With access to reverse
traceroute even I, as a grad student, was able to talk to people who could fix this
path.
>>: Talking to the people seems to be the problem, right? How do you call up
the ISP and get the right person on the phone? That's a much harder problem.
>> Ethan Katz-Bassett: It's a different type of problem, right? It would be hard
for me as a grad student to go around debugging all the problems of the internet
using this tool. But actually in the Google system that I used to sort of motivate
this section of the talk, that's actually what they're trying to get is they want
something to point at the problem, and their assumption is once they can point at
the problem they can do something about it. They have the clout and the
connections to call them up or is they can change the routes or they can install
more capacity or whatever it was. So the main thing that they wanted was a way
of classifying what the problem was and where it is.
>>: It seems like you have to [inaudible] showing one network to the entire rest
of the world every place where [inaudible].
>> Ethan Katz-Bassett: So that's essentially ->>: -- generate the list of, you know, here's all the networks that have got to be
fixed.
>> Ethan Katz-Bassett: Sure. Sure. So this was me debugging one particular
path. The paper that they have at IMC 2009 is essentially that system, and the
problem is that that system needs reverse path as part of its input and it didn't
have it. So that's how they do it. And presumably people at Microsoft do
something similar where you want to prioritize where the problems are and come
up with solutions that can fix multiple problems and things like that.
So the lack of reverse path information has been identified as a key limitation by
Google, by operators, by researchers. You really had no way to get this
information before, so anyone who cared about internet routing was really only
seeing half of the information.
I've now built this reverse traceroute system that lets you measure the path to
you without controlling the destination, and I gave an example of how you can
use it to debug performance problems. Google is now starting to use it in that
context.
Reverse traceroute is also useful in a range of other settings, including
availability problems. And I'm now going to talk about how we can ->>: Before you move on, I'm curious, this whole tool was developed as a sort of
end user, from an end user's point of view, right? Why the network operators
don't use something -- because they have access to [inaudible] if they really want
to inject something there, it's a lot easier for them to do. Is there no incentive for
them to do it?
>> Ethan Katz-Bassett: They have access to routers in their own network, but
they don't have access to routers in other people's networks, and people in the
other networks don't have much incentive to give them much information about
what's going on in their router.
>>: If this is a big problem, why don't the network operators join together and
solve it?
>> Ethan Katz-Bassett: So public traceroute servers which [inaudible] talked
about are sort of a step towards that. If networks make available a website, you
can go and issue traceroutes from their vantage points, but there's only about
1,000 of those, and I think it's just hard to get to the type of coverage you really
need to debug problems across the internet.
One direction of my work that I'm not really going to talk about today is we're
starting to look at what are small tweaks you might make to the IP protocols, so
maybe a new IP option that we can add or something like that that would make
some of these measurements easier and expose some of that information. And
we're hoping that we can eventually get buy in. But it's a long, slow process to
get any of that -- to have that work at the scale that they need, you need to get
standards changed, you need to get router vendors to buy in, you need to get
everybody to upgrade their routers to support. So it's not just going to happen
immediately, and this stuff I'm talking about today is really how can we start
solving these problems right now using what's available right now.
Did you ever another?
>>: Just a quick followup. You talk about the incentives, right? It seems there's
already enough incentive for the different ISPs to support, for example,
traceroute, like the standard traceroute.
>> Ethan Katz-Bassett: Right.
>>: Obviously the reverse one would be harder, right? But in some sense -- I
mean, if there was no incentive whatsoever to collaborate in some sense ->> Ethan Katz-Bassett: There's sort of this tension where I think the operators
are often willing to cooperate, but then there's also parts of the company that ->>: [inaudible]
>> Ethan Katz-Bassett: Yeah. So, I mean, I presented all of this work at
operators conferences before I even wrote the paper to try to get buy-in, and I
think there's been dozens of operators signed up to try to use the system, things
like that. So I think that once they see that you can start making these
measurements -- you know, it's sort of like that quote said. People viewed it as a
fundamental limitation that just this information was invisible. Once you start
showing them how you can actually see it, the hope is that we can get better
support, get people buying in and then main eventually they will move towards
easier solutions for providing this type of information.
So now I'm going to talk about how we can start to build a system around reverse
traceroute that's going to automatically identify, locate, and avoid failures. And
it's all work in progress. So I'm going to sketch out where we are now, where
we're going with t.
I'm going to start out with an example of what operators do now. So this is a
quote from an email to a network operator at Outages mailing list that I subscribe
to. This operator thought that he was seeing problems in Level 3's network in
D.C., didn't really know what was going on, and so sent an email to the mailing
list to ask people for help. And in the email he included a traceroute. So this is
essentially his traceroute.
It started out at his home in Verizon in Baltimore, it went down to D.C., switched
over to Level 3's network and then trailed off. So when the traceroute trails off, it
means that he didn't get a response back from the destination, so it looks like
make there's a problem in Level 3 in D.C.
But assuming that the destination is configured to respond, it could be that the
destination isn't receiving the probe. It could also be the destination is receiving
the probe, but the operator isn't receiving back the response. And with
traceroute alone you can't tell the difference between these cases.
So what do the people on the mailing list do to help this guy? Well, they start
issuing their own traceroutes and sending them to the mailing list. So here's
another one of those.
So this traceroute starts out again in Verizon but in D.C. in this case, switches
over to Level 3, goes from Level 3 in D.C. to Level three in Chicago to Level 3 in
Denver and then it trails off.
So, again, first it looked like the problem was in D.C., now it looks like the
problem is in Denver. Did the problem move? Are they two separate problems?
Are the problems on the reverse path?
With traceroute alone you really don't know which one of these cases you're in.
And actually nobody on the mailing list ever indicated that they had any idea
what was going on. They just sort of each sent out their traceroutes for a while
and then, you know, a couple hours later the problem resolved and nobody ever
really gave an explanation. So we'll really like to do better there.
So in ongoing work, I'm taking a three-step approach to that. First, you need to
identify that a problem is going on. So that's monitoring and figuring out the
problems.
Once you know that there is a problem you want to be able to locate where in the
network the failure is. Once you've located the failure, you'd like to be able to
reroute traffic around the failure even if the failure is in a network that you don't
control.
So as we saw on that Hubble outages map earlier, many of these problems are
lasting for hours, sometimes even days. So if we give operators better tools they
could likely fix those problems much faster. But the outage is still going to persist
until it's fixed. And so even with better tools, operators are going to be fixing
these problems on a human time scale. We'd like to be fixing the problems
within minutes or seconds instead of.
So once we have these pieces for monitoring location and remediation we can
start putting them together as building blocks for a system that's automatically
going to repair failures, and that's what I'm going to talk about.
First I want to give a characterization of the duration of outages. So these are
results from a two-month study at monitoring a diverse set of targets around the
internet. This graph shows the duration of the problems that we observed. So
on the X axis we have duration in seconds, and it's on the log scale. It's a PDF.
So any particular point on the graph shows the probability of an outage of a
particular duration.
There's two points that I'd like for you to get from this. First, most of the outages
are short. So 90 percent of the problems last less than 700 seconds.
Second point: The distribution has a long heavy tail. The remaining 10 percent
of outages that lasted over 700 seconds accounted for 40 percent of the
unavailability.
So this means that to make a substantial impact on improving availability we
really need to address problems across time scales.
>>: Do you have a sense of why the graph, the shape is so weird? Why does it
go up and then go down?
>> Ethan Katz-Bassett: As opposed to continuing to go up?
>>: As opposed to [inaudible] [laughter]
>> Ethan Katz-Bassett: Oh, you mean why don't you see more short -- like much
shorter problems?
>>: Well, it seems there's a preference for a certain duration window.
>> Ethan Katz-Bassett: Yeah, so it could be that -- the short-term outages are
often caused by running -- it could have something to do with how long running
protocols take to converge. It could also potentially be an artifact of the particular
way we measured. I don't know. The only point that I'm trying to get on the
graph is that there is a long heavy tail. I'm not going to talk about the short-term
problems today.
With the short-term problems -- so we're going to take a -- basically we take a
two-pronged approach to addressing these outages, the short ones versus the
long ones, and it's partly because you can use different solutions. So you can
imagine on long-term outages you can actually get a human involved to try to
repair them. It's not going to work on a short-term outages because people
aren't going to be able to react quickly enough. And it's also partly because the
underlying cause is different, so it's already known that there are these
short-term outages after failures during routing protocol convergence.
And we actually worked on a system called Consensus Routing to address
those. So in that case it was pretty well understood that it was a protocol
problem. People had already done the measurements, and we just built the
system to address it. And today I'm going to focus just on the long-term
problems. In that case they're less understood so the first thing that we had to do
is come up with measurement techniques to identify characteristics of them.
So identify the characteristics of those long-term problems is the first goal here.
We want to monitor problems across the internet on a global scale continuously.
And I'm going to refer to these long-term problems as black holes in these cases.
There's the BGP paths available. When you send traffic, it persistently doesn't
reach the destination.
So this is the system Hubble that I showed the snapshot from earlier. Hubble
monitor networks around the world continuously. It's the system that I built, but
just to save time, I'm not gonna go into any details of how it works. I'm just going
to talk about how many of these long-lasting black holes we saw over time.
So we did a two-year measurement study with Hubble. We monitored 92 percent
of the edge ISPs in the internet. So what I mean by an edge ISP is one that
hosts clients it, hosts services, but it's not acting as a provider for any other ISPs.
During that time we saw over one and a half million black holes. So each of
these cases is a network that had previously been reachable from all of the
Hubble vantage points, became unreachable from some for a period of time and
then became reachable from all of them again.
This graph shows the duration of these problems. So the X axis is duration in
hours on a log scale. It's a CCDF, so a particular point shows what fraction of
the problems lasted for at least some number of hours.
So the purple circle there shows that 20 percent of the problems lasted for over
10 hours. So we were really astounded at how many of these problems there
were, how long they lasted, how many of the networks they affected. So over
two-thirds of the networks that we monitored experienced at least one of these
problems.
Dealing with them requires someone first noticing that the problem exists, then
they have to figure out what's going on, then they have to do something about it.
All of these steps take a long time with current tools.
So, for example, last month one of CNN's websites was out for over a week.
We'd really like to do better.
The first thing we need to do is try to locate one of these problems. So we saw in
the Level 3 example from the outages mailing list that it's hard to understand
failures with current tools because it's hard to understand the output of those
tools.
I'm going to walk through what we can start to do to get better information there.
So up here we have a destination D. It's at USC's Information Sciences Institute.
X, y, and z are Hubble vantage points around the world. Previously they'd all
been able to reach D.
The system detected that there might be a problem, all of the vantage points tried
to reach it, and what we see is X can still reached the destination, Y and Z no
longer can.
So what can we do to start understanding the failure? Well, the first thing we can
do is just group together the paths. So in this case we see that paths through the
Cal State University network are still reaching the destination. Paths through
Cox Communication are failing with their last hop in Cox.
So it looks like maybe there's a problem with Cox Communication. But actually it
could be that the traffic is reaching the destination and that it's failing on the
reverse path back. Traceroute alone doesn't tell us which direction the failure is
in even if we have symmetric paths. Plus the paths might be asymmetric, so it's
possible that the reverse path isn't even going through Cox.
So with traditional tools we couldn't differentiate these cases. With reverse
traceroute we can actually tell which of these cases we're in.
The key is that reverse traceroute doesn't actually require the source to send any
of the packets. We can send the packets instead from X. We know that X has a
working path to D, so it's going to send a packet, spoof a Z. Because it has a
working path, the packet's going to reach D, D is going to reply back to Z. So if D
has a working path to Z, then Z should receive the response, and that's what we
saw in this case.
So now we can use reverse traceroute to measure that complete working path.
So now we know the failure is on the forward path. Cox isn't forwarding on traffic
to the destination.
Now, we'd like to understand what happened to cause this failure. If we only
make measurements after the failure starts, like the operators on the mailing list,
we don't know what change there was that triggered this.
So what we do is build up a historical view of what routing looks like, so then
when there is a failure, we can look at what routes looked like before the failure.
So one possibility is the previous route didn't even go through Cox, the route
change over to go through Cox and the Cox path never worked. In this case
what we observed was actually the previous route did go through Cox and it went
through a router R. Now it's failing right before R.
Y observed a similar path. Previously went through router R and now it's failing
right before R.
So it seems like there's a problem at router R. It's still advertising the path, but
then the path's not working.
We only know which router to blame because we're measuring this historical
information. And in fact in most cases we found that you need to have access to
this historical information to understand one of these problems.
So this is ongoing work. I'm going to present -- we're running the system
continuously. Now I'm just going to give an overview of some of the preliminary
results.
>>: In the last example you can't really blame R, right? It could be some other
router that [inaudible] the route that R had. If you're trying to do router-level
localization, that seems not correct in a lot of cases
>> Ethan Katz-Bassett: So a few things. First of all, we know that R is still
advertising some path because the traffic is still going to R. Second, we're not
necessarily trying to say that R is responsible for the failure. What we're really
tried do is show paths that aren't working so we can start reasoning about other
paths that are working so we can route around it. Because we don't control Cox
Communication, so maybe we can tell them, hey, it looks like there's a problem
with R. But really what we want to know is that paths through Cox
Communication aren't working now so that we can start exploring other
alternatives that are working that maybe we can use even though we don't
control Cox. And I'll get into how we do that in a second.
>>: I guess the question was, like, yeah, so I realize [inaudible] Cox has a
problem, but what I'm trying to get at -- I'm just trying to understand is telling Cox
that R is the problem, does it buy you anything more than telling Cox that this is
the source of the destination that's not working?
>> Ethan Katz-Bassett: You can imagine that we might observe other paths
through Cox that do work. In this particular problem we didn't, but there might be
other problems where we did. And so it seems like there's some value in saying
the sets of paths that used to go through this particular router aren't working to
help narrow down where the problem is and also to let us reason about other
paths through Cox that might be working if we have an alternative of using a
path, say, from a different peering point that isn't going to go through R.
We're not trying to say anything definitive about R. We're just trying to come up
with some cause that seems like maybe it explains the problem so at least we
can associate Y and Z with each other. If we didn't know that those previously
went through R then we wouldn't even have a way of associating Y's problem
with Z's problem. I mean, it would look like they were in different parts of the
network because they aren't overlapping at this point.
>>: But doesn't your traceroute information tell you that your -- those yellow
paths that you see there, you can follow those a traceroute up to but not up to R,
right?
>> Ethan Katz-Bassett: Exactly. The pieces of information that you don't get if
you're just using traceroute if you haven't measured that ->>: [inaudible]
>> Ethan Katz-Bassett: And then you also don't know that the problem isn't on
the reverse path. If you're just using traceroute, it could be that ->>: [inaudible]
>> Ethan Katz-Bassett: Yes.
>>: I'm really questioning that I have historical information, it seems like if I'm
truly saying -- if you go to Cox and you say, look, here are two paths, I can
traceroute them each up to this point and no further, so these routers, wherever
they're pointing to next on this path, they're not getting to, that's a lot of
information for Cox to be able to identify the next router
>> Ethan Katz-Bassett: Sure. Let me give a slightly different example that
maybe is more compelling that we also observed. We have statistics in the
paper.
Imagine instead that these paths are going through different networks and both
dying right before Cox. If we don't have access to information about what the
routes looked like historically or what routes are being advertised, we don't know
that those two are converging on the same point, and so there we won't have this
common information about Cox. And so we might even try to select a path that
bypasses those networks but goes in Cox, and we might suspect in those cases
that it wouldn't work.
>>: Okay
>> Ethan Katz-Bassett: It's less about -- we're not claiming that we're pinning
down -- this is the exact cause. It's more about gathering as many types of
information as we can to try to reason as best we can about what's going on.
Because you really have such a limited view from externally.
So we're running the system continuously now. Just to highlight one point that
we've observed. More than two-thirds of these outages are partial. So we have
some vantage points that can reach the destination, some that can't.
And as I was saying before, these partial outages aren't just hardware failures.
It's not just a backhoe cutting a path. There's some configuration that's keeping
the destinations it can't reach from being able to find the paths that do work.
So in these cases identify where the failure is and identify these alternative paths
that exist is a big win. If we actually can give this information to operators, they
can get in touch with the networks that seem to be causing the problems and
they can essentially, yes, ma'am, at these people until they fix it.
But we'd really like to be able to make these repairs faster without requiring the
involvement of these other networks that you have to call up and yell at. Working
paths seem to exist. They're not being found. It's a big opportunity. How can we
find them automatically?
In this case let's say that operators at the web server realize that they were
having problems with some of their clients. We're going to look at the easy case
first.
Suppose the failure is on the forward path. In this case what the web server can
do is just choose an alternate path that doesn't go through Level 3. So maybe it
will route through Quest instead. That avoids the failure. They can also direct
the client to a different data center that has working paths.
The harder case is when the failure is on the reverse path back to the web
server.
>>: Did you say web server chooses the path or the web server's ISP?
>> Ethan Katz-Bassett: Yeah, the web server's ISP. Sorry, I'm sort of ->>: [inaudible]
>> Ethan Katz-Bassett: Yeah. It's not the web server itself. And maybe I should
have named the web server's ISP differently from the web server.
So this is the harder case. The operators in the web server's ISP don't directly
control which path University of Washington is chewing, but they want a way to
signal to the University of Washington to choose a different path that doesn't go
Level 3.
They want to send like a don't use Level 3 message. You can imagine this
message gets to Quest, Quest already isn't using Level 3 so it's not going to
change anything for them. Advertisement gets up to University of Washington.
University of Washington is using Level 3, they'll see this, they'll decide not to
use it anymore and route a different way through Sprint.
Of course, in BGP there's no don't use this message. So we need a way that's
going to work in the protocol today.
Let me show you how we can do that.
All the web servers ISPs operators have control over is which paths they
announce. And so they want a way to change their announcement that's going
to force other people to choose different paths.
So what do the baseline announcements look like? Well, in the base case
they're just going to announce that the path to the web server just going through
the web server's ISP. They announce that to their neighbors. Now AT&T and
Quest have direct paths, they announce those on, paths propagate through and
then University of Washington chooses this path through Level 3 to AT&T to the
web server's ISP.
Here's what the web server's operators can do to change the University of
Washington's path. They're going to announce instead that the path goes from
the web server's ISP to Level 3. They'll announce this on Quest. Quest doesn't
really care. They're just going to continue routing to the web server's ISP,
announce that on, and so on.
Similarly, AT&T doesn't really care. They're just going to keep routing to their
neighbor the web server's ISP.
Here's the key. BGP has built-in loop prevention. When you get an
announcement, you look. If you're already in the path, you reject that route. So
when AT&T makes this announcement to Level 3, Level 3 is going to inspect the
route, see that they're in it, and reject it to avoid causing a loop. So suddenly
Level 3 doesn't have this route anymore.
This means that the University of Washington isn't going to be getting this route
from Level 3 anymore so they're going to have to look for other routes. They're
given another route from their neighbor, Sprint, so they're going to choose the
route through Sprint instead.
>>: [inaudible]
>> Ethan Katz-Bassett: So that's one strand of research I'm doing [laughter].
>>: Okay
>> Ethan Katz-Bassett: So the -- I mean, it's similar to the argument about
spoofing. We're doing this in a controlled manner. In ongoing work I'm
working -- I might as well cut over to the next slide since that's what the next
slides say -- I'm working to show how you can do this safely.
You can imagine that as we redesign BGP to maybe give you some of these
security mechanisms, you can't do things like this. We could also think about
how to redesign it to let you make these don't use this announcements better.
This particular strand of research is about what we can do today to simulate
those don't use this messages and this is a technique for doing that.
>>: Go back one slide. So isn't -- so isn't, for example, Quest -- let's see.
Maybe not Quest. If Quest has extensive links with AT&T and Sprint they might
say, oh, I'll send over traffic [inaudible].
>>: [inaudible]
>> Ethan Katz-Bassett: So these routes -- I maybe should have used a slightly
different notation. Each of the routes that you announce is tied to the particular
prefix that it's going for. So it's just going to announce this prefix for the address
blocks that includes the web servers, basically its own address block.
So it won't be saying to get to any Level 3 address, send it here. Each of these
paths is associated with addresses. So the Level 3 addresses will still be
associated with the announcements that Level 3 is making. These particular
announcements are just for this block of addresses that web server's ISP owns.
>>: But L3 can still service the link through GBLX, right, in your example? I
mean, you're taking money away from L3 and giving it to Sprint, effectively, on
the end. There are peer wars where people could end up trying to harm each
other's businesses by sending out these loops saying, oh, I thought your router
was down so I just -- you know, it was only a few hours [inaudible] sorry about
that. Because now you have to have proof that the failure actually exists
>> Ethan Katz-Bassett: Sure. So that's one of the properties that we're trying to
demonstrate with this system is that you can do this in an understandable way
their you can show that there's a failure. You can imagine we can use our failure
location system to pin down a failure, and we also have techniques to identify
when failure no longer exists so that you can route back.
In this case Level 3 wouldn't be able to route through GBLX. Given this
announcement, they wouldn't be able to because they would be getting the same
announcement with Level 3 with the lop from GBLX, but ->>: [inaudible] route around AT&T through L3. Right?
>> Ethan Katz-Bassett: Sure. So our assumption is that in this case Level 3 isn't
doing anything to repair the problem because the problem is persisting. If they
had already selected that alternate route then we wouldn't have had to do this in
the first place.
And there are certainly issues around how long do you wait before doing this,
and we're sort of evaluating this in the ongoing work. You can imagine that the
distribution of failures says that if the failure is short because they're
reconverging to a new path that works through GBLX, then we'll let it go. If the
problem is persisting longer, you're not getting your traffic, you want to do this.
>>: What if the link is fixed? What do you do to at that point in time to the
scenario where L3 fixed the link?
>> Ethan Katz-Bassett: So there are these properties that you'd like if you're
using a system like this. So we're using this BGP loop prevention as our basic
mechanism, and we're starting to evaluate it now. But you want to get properties
out of it. I like you want it to be predictable, you want to know when you can
revert back to your old path. And we do have techniques for doing that.
In that case what you can do is announce a less specific prefix without Level 3
inserted into it and then I can continue to monitor over that less specifically prefix
until the problem is resolved. Once it's resolved I can revert my announcement.
>>: [inaudible]
>> Ethan Katz-Bassett: So I'm not claiming that what I just showed on the
previous slide is non-disruptive. What I'm claiming is that we have techniques
that we're evaluating that go along with this loop prevention. That's just the basic
mechanism. But you can do other things to make it less disruptive, and that's
what we're evaluating right now.
So in that particular example you don't want to disconnect people who are singly
homed, but if I announce a less specifically prefix that doesn't have Level 3
inserted in it then those singly-homed customers are still going to be able to use
that less specific prefix. So if Level 3 starts working or if some routes through
Level 3 work, they'll be able to use those, but people who are stuck -- who aren't
stuck behind Level 3 and whose traffic is currently failing, like University of
Washington, can explore other alternatives over the more specific route.
>>: So I see you're trying to use this mechanism to deal with traffic failure. I
mean, today many web servers already have different locations, and the purpose
is to fix traffic failure. So if I already have [inaudible] why do I still need to
introduce this kind of sophisticated mechanism that could be potentially cause
more harm than benefit
>> Ethan Katz-Bassett: Sure. So if you have alternate data centers that have
work paths, you should redirect there. This is for cases when you don't have an
alternate -- that's essentially the equivalent of choosing an equivalent path out
through a different provider. Just go to a different data center.
This is for cases when you can't do that. The cases when you can't do that are if
they're routing to all of your data centers is similar, like if they -- you can imagine
that the University of Washington has these multiple providers. There's going to
likely route towards all of your data centers through the same provider. If that
provider is not working, you need a way to get them to go to a different provider
or else their paths to all of your data centers are going to be failed.
Similarly, there are some people who don't necessarily have access to these
multiple locations around the world, they only have a few, and we want ways to
enable them to get around the failures, too.
>>: Well, but people who have -- I think the space that people [inaudible] it's
people who are small enough that they don't have geographic distributed data
centers but large enough that they have control over their access network and
they're making these advertisements. Is there such a person anywhere?
>> Ethan Katz-Bassett: I can say that -- we've just recently started actually
experimenting with this, and the first people we talked to were Amazon because
they're local, and they thought that it seemed like a reasonable thing to do in
certain cases. There were problems that they've experienced where they can
imagine trying to do something like this. And they certainly have multiple data
centers.
>>: Yeah. There's something about this that sort of smells like router digital
[inaudible] or something, and I'm wondering if -- I mean, if you look at spam,
there's a group of people who I think jokingly proposed, oh, we should just write
our own computer viruses that go out and infect computers and fix the
vulnerabilities so the bad guys won't have control over them. That's one strand
of research. But there's another strand of research of, well, why don't we
generate blacklists of known bad hosts that ISPs can opt into. It's less sexy
because it doesn't have this sort of other aspect to it, but it seems like something
that would cause far less damage and would be more accepted by the operator
community is to say we're going to set up a service where we keep a realtime list
of routers that we believe to be bad and then you can subscribe to that and say,
oh, looks like L3 is down according to the service, I'll route around it, as opposed
to globally advertising this L3 route without their permission.
>> Ethan Katz-Bassett: So I believe that the approaches are complementary.
We're not going to have a world immediately where everybody subscribes to
such a service. There's not such a service available yet. Even once we do, it's
going to be hard to get everybody to opt into that. And the web server is still
going to want ways to service customers who maybe haven't subscribed to that
service.
What we're trying to do in this work is evaluate how effective the technique is. If
we can demonstrate that it's not that disruptive, which is what we're trying to
demonstrate, then maybe it's something that people will more likely use.
It's definitely always going to be like a fallback. This isn't your first line of how
you deal with problems. But there are problems right now that aren't be
addressed that are lasting for hours, days. There are networks you can call up
who even if you yell at them, they're not going to fix it. There's going to be
networks who you don't know who to call, they're going to be clueless when you
do call them, and you want ways to be able to deal with these problems, and
that's trying to address this.
I don't think that it's the only solution if you're dealing with outages. You want to
direct to other data centers, you maybe want to offer a service like this where
University of Washington can learn the information themselves. I think that all
these approaches complement each other.
>>: You buy that there are clueless ISPs and we have to be able to deal with the
fact that they're not going to fix the problem. But in this diagram imagine that it's
AT&T who was having the failure. L3 has an incentive to serve their customers
as best as possible by routing around AT&T if they're having a failure. So it
seems like the right incentive here is to have each ISP be proactively sort of
probing and recognizing that one of its -- one of the people that's advertising the
route isn't following through and actually routing as advertised. And so then
here's an alternative. And it seems like that that sort feels like the right way to fix
the problem as opposed to the web server that's trying to police every ISP in the
world on behalf of its customers.
>> Ethan Katz-Bassett: The Level 3, though, would have to police every route
from everyone of its, you know, upstreams to everyone on the other side of it.
And it's not necessarily clear that it has a great way of knowing what traffic
should be going through at any point in time.
If it stops getting traffic one way, it doesn't know if that's because somebody
changed routes. If it is sending the traffic, it's not entirely clear. I mean, then
everybody would have to be continuously probing in each direction to detect
those problems. Otherwise, the web server's ISP is going to need a way to
signal to Level 3. That's not going to happen any time ->>: But what you're proposing is probing anyway, just in a different place, which
is to say every web server
>> Ethan Katz-Bassett: Right. Which already -- but that's the sort of probing that
already happens. You can already passively monitor your traffic, identify when
you have dropouts in clients.
I agree that there are other approaches that are maybe nicer ways of dealing
with the problems, but any of those is going to require additional infrastructure,
require additional people to buy in, and I just don't think that those are going to
be available tomorrow, whereas I've already experimented with this, I've been
able to do this and change paths like on the actual internet, and so this is sort of
a solution they can use.
I think that work towards proposing how we might want to redesign protocols or
layers we might want to add to address these problems in different ways is -- I
mean, it's definitely important to do that, and I've done a little bit of that as well.
I think that we also need to solve the problems today, and that's where this piece
fits in.
So I think through the process that I ended up talking about these three main
points that we want. But essentially the point is that the loop prevention is the
basic mechanism. We have other techniques to try to get some of these nice
properties that you want like not cutting off people who already have working
paths, successfully rerouting people who don't, understanding when to stop doing
this and when to start doing it.
We had this statistic from Hubble before that 20 percent of outages last for over
10 hours. So that's 600 minutes of unavailability. If we were able to locate one
of these problems and generate an automatic response to it within a few minutes,
we've decreased the amount of unavailability by two orders of magnitude. And
that's really the vision that I want to leave you with.
Should I head to the conclusion slide?
>> Ratul Mahajan: Just, yeah, wrap up.
>> Ethan Katz-Bassett: Okay. The internet is hugely important, so we really
need it be reliable. I started out by talking about the fact that the internet has
suboptimal performance and availability, and we need a two-prong approach to
addressing this.
I talked about two systems to illustrate that approach. First, we need better tools
for operators. Traceroute has been the most used tool for trouble shooting these
problems, but there's this fundamental limitation of not giving you reverse path
information.
I built a reverse traceroute system to address that problem. It's really essential to
understanding the internet.
Second, you need systems that can use these tools to automatically fix
problems, and I illustrated how we can start to do that by identifying, isolating and
routing around failures.
All my systems are designed to work in today's internet, so it's really about
getting as much information as you can by using the protocols in novel ways.
Because they work in today's internet, operators are able to directly apply the
work, and I've present both pieces of this work at both RIPE and NANOG, which
are the big European and the big North American network operator conference.
RIPE and Google are now working to deploy these systems in order to improve
the internet and start using them to improve the availability and the performance.
Thanks. I'm happy to take questions.
[applause]
>>: So you focused around the network as the cause of [inaudible] do you have
any numbers about what are the portion of sluggishness or [inaudible] or do you
just service failures?
>> Ethan Katz-Bassett: It's not something that I've looked at. The focus of my
work has been on these interdomain routing issues. Certainly there are
problems at both ends, and you need to be addressing both of them. I guess
because this has been the focus of my work, I'm not sure how many problems fall
into the other domain versus this one. There's certainly plenty of problems in this
domain for people to try to solve and plenty in the other.
>>: So one of the things that people argue [inaudible] is that the internet is sort
of shrinking
>> Ethan Katz-Bassett: Yeah, definitely.
>>: So can you comment on this trend in terms of your work?
>> Ethan Katz-Bassett: So there's been some interesting work on that. So on
the one hand the internet is shrinking for people like Comcast and Google and
probably Microsoft as they try to get more direct connections, things like that.
It's also -- if you look at the median path link, I don't think that that has changed
that much, and that's a lot because the internet is also expanding into, say,
developing regions and things like that where they're not going to have these
really rich -- this really rich connectivity.
So any particular path may have gotten shorter, but paths distributed across the
internet haven't necessarily gotten that much shorter. I actually think that this
work fits pretty well into both those trends.
One issue with this sort of collapsing topology of the internet that you brought up
is that you sort of have less visibility into it. It's hard to measure these peering
links -- you can only measure these peering links if you have a vantage point
either down here or down here. If your vantage point is out here, you're not
going to be able to traverse that. Just because of policy, they're not going to
export that link.
One thing that we show in the papers that you can start observing more of those
links if you actually have reverse traceroute information because you can sort of
inject packets into the middle of that have using spoofing and observe more of
those links. So I think that this helps gives visibility into parts of the internet that
maybe are hard to see with traditional techniques.
I think the other interesting implication of both these trends, like the collapsing
topology and then also expanding the developing region, is that both of those
tend in certain ways to make the internet a little bit more brittle. If people are
getting rid of their tier 1 providers or having fewer tier 1 providers because they're
connecting directly, suddenly they don't have this essentially backup path where
they can fail over to the tier 1 and it's much easier to start partitions pieces of the
internet off. So I think that it becomes important to have these debugging tools
and tools that give visibility into those parts of the internet because even as they
improve performance, they don't necessarily improve the resilience of the
network.
>>: I'm curious just to hear, what are your next steps [inaudible] or something
else? What's your longer term plan?
>> Ethan Katz-Bassett: I missed one part of the question. What domain did you
mention?
>>: [inaudible]
>> Ethan Katz-Bassett: So I'm interested in looking at sort of both types of -both continuing in the space and then also expanding into new spaces. So I
talked today about how you can use reverse traceroute to get visibility into sort of
the round-trip path. So I only had visibility in half, now I can observe the
round-trip path and start ->>: [inaudible]
>> Ethan Katz-Bassett: So all I was going to say is that we're starting to build up
these reverse path maps where you can observe suddenly the routing of the
entire internet, and that lets you -- it's not so much about having reverse -- that
it's reverse traceroute, it's that it gives you a view into complete routing of the
internet. And so I want to start using this new visibility where now I can see how
everybody routes to me. I can start seeing the routing decisions that they're
making and reason about what paths they weren't choosing that they might have
select and sort of use this to answer questions about what policies are available
on the internet, what policies are in use, what are the cause of particular routing
changes.
I'm also interested -- I think I'm going to continue on in the internet measurement
space as well. I have a number of ideas along that. But I also want to use my
experience in interdomain routing to start addressing other questions.
So as we move towards having more services, more data in the cloud, we're
going to have this need for high availability and high performance, and one part
of that is these troubleshooting tools that I've talked about. But I'm also
interested in looking at other questions that arise in that space. What type of
network knowledge do cloud services need in order to optimize their use of the
internet, how can we provide that information to them. They're not necessarily
going to have the internet expertise because they're sort of outsourcing that to
the cloud infrastructure. How can we give them access to that information in
ways that they can reason about.
There are also a number of emerging challenges in the internet. So the
collapsing topology is one. Mobile -- I'm interested in sort of starting out from this
interdomain space and expanding into those other spaces using what I know
here to start -- mobility challenges, for instance, are going to require addressing
interdomain problems, but they're also going to require addressing problems at
other layers. I'm interested in moving into those spaces.
>> Ratul Mahajan: Thank you.
[applause]
Download