>> Ratul Mahajan: Well, good morning. Thank you, everybody, for coming for what should be hopefully the last talk of this marathon talk week. Today we have Ethan from University of Washington just across the lake. Ethan is going to actually tell us about, like -- his work is very interesting. He's solved some, like, longstanding problems with the internet, but in a very practical and usable manner. And he's going to tell us about that. But apart from a really good networking researcher, I can testify he's also a really good skier. He spent a few years being just a ski bum somewhere in Utah or Montana. I forget. I've seen him ski. He's very good. So that's kind of another reason for us to, you know, get him here. He can improve our average skiing level. >> Ethan Katz-Bassett: I'm also happy to take questions about skiing if there's anything particular that you're working on. I was a ski instructor so I can pass some of that on. Thanks Ratul. Thanks, everybody, for coming. So I'm going to try to convince that you the internet is much less reliable than it needs to be but that we can make substantial progress both in improving its availability and performance without requiring any architectural changes, and I'm just going to talk about a set of distributed systems that I've built to get us towards that. So I want to start out just with a simple survey. Can you raise your hand if you've used email or used the internet since you got to this room. All right. So many people. How about in the last hour? More people. Today? Probably just about everyone. Anyone who didn't raise their hand is probably just busy checking right now so they didn't actually notice that I was asking. So my point is that we rely on the internet. And I think even though we all know that we depend on the internet, it's still astonishing when you see some of the actual numbers. So over a quarter of the world's population is online today. In the time that it's going to take me to deliver this talk, humankind collectively is going to spend over a millennium on Facebook. And actually in the 10 minutes that it took me to put together the statistics for this slide Kanye West sent three new tweets. There's also a lot of money involved. So companies like Google, like Microsoft are making billions and billions of dollars in ad revenue. And because we depend on the internet so much we really need it to be reliable. So traditionally we've used the internet for applications like email, like web browsing. E-commerce site have depended on it being reliable in order to deliver their business. Now we're accessing data and applications in the cloud, we're using the internet to watch movies, to make phone calls, and we really want our cloud data to be reliably available, we want our movies and phone calls to be smooth. But, you know, I use Skype to try to do meetings with my advisors sometimes and we've all experienced having to, like, hang up multiple times before it works well. And in the future we're moving towards a world where we're even more dependent on the internet. We're going to have more of our applications in the cloud, we're going to have most of our communication over the internet and we're going to want to run critical services across the internet. For these applications we really need to be able to rely on it. We need high availability and we need good performance. So we're just asking, you know, are we ready for that? I'm going to show you that the current answer is no but that we can take some concrete steps towards improving it. So I'm going to start out just by talking about the current situation. So actually we all track internet availability every day. Seems like usually most things work. Sometimes they don't work. But each of us is only checking from -we're only seeing a small slice. We're only checking from work, we're checking from home, and there's actually hundreds of thousands of networks that we could be checking. Places like Facebook, Microsoft are having people working full time to try to maintain availability, but they still struggle with these problems sometimes so we can really ask what about everybody else? I decided to assess reachability on an internet scale continuously and I built a system called Hubble that checked over 100,000 networks around the worked and it checked them from a few dozen computers around the world. Each of these vantage points would periodically check to see if its route to all the other networks still worked. All of these networks were available from all of the vantage points at the beginning of the study. And as I started making these checks I realized that there were problems happening all the time all around the world. At any point in time there are locations all around the world that are reachable from some places and not reachable from others. So this is a snapshot that was generated automatically by my system, Hubble. Each of the balloons here represents a network that had previously been reachable from all of the Hubble vantage points. It had become unreachable from some of the vantage points while remaining reachable from others. So we know since it's still reachable from some that it's not just the other networks offline. And these partial outages shouldn't really happen. There's some router configuration that's keeping some of the vantage points from being able to reach the destination. So in this case the darker the balloon, the longer the problem had lasted. So the lightest color is the yellow. Those have been going on for up to eight hours. The darkest color is the maroon. Those have been going on for more than a day. So there's these really long-term problems that aren't being fixed, and if you're actually behind one of those problems you're basically stuck. >>: How do you know it's a router configuration issue and not just not having enough redundancy at the level of policy [inaudible] and you could just disconnect it without any kind of configuration error? >> Ethan Katz-Bassett: So we know that it's not just a hardware failure because hardware failures can only result in this partial reachability if there's a network partition, and there's not a partition because both the one that could reach and the one that couldn't can talk to the University of Washington. It could be policy and the policy is encoded in the router configuration. So in that case it's not necessarily a misconfiguration but it's an issue with the configuration of routers. In other words, if they reconfigured the routers, changed the policy, then it would become reachable. So I'm not getting super specific about whether it's a misconfiguration or not, but it has something to do with configuration. It's not just a fiber cut. >>: So are these all permanent nodes? >> Ethan Katz-Bassett: We were checking from the planet lab nodes to every network that we could find a pingable address in. >>: So these [inaudible] are just to some network address in sort of matched network? >> Ethan Katz-Bassett: Yeah. >>: And how do you check the [inaudible]? >> Ethan Katz-Bassett: So we -- the vantage points each periodically ping to see if the destination is still responsive. If it isn't responsive from one of the vantage points for a couple pings in a row then we trigger trace routes from all of the vantage points and we see whether the trace route reaches the network. So we don't care if it reaches the particular end host. We just see if it reaches the network. So in all these cases there were some trace routes from some vantage points that reach the network and in most cases also reached the end host. And some of the other trace routes weren't even getting to that edge host, so they weren't getting the whole way across the internet to the network. >>: So are these problems close to the edge network or it's somewhere in the middle of it? >> Ethan Katz-Bassett: So many of them are fairly close to the edge, but they're not right at the edge, because if it actually reaches the last network then we don't consider it a problem anymore. Some were in the middle. I honestly don't have a great characterization of how many were in the middle and how many in edge, but I think the instinct is that a lot of them were towards the edges, right? >>: Yeah, I think once [inaudible] edge because the [inaudible]. >> Ethan Katz-Bassett: So we tried to eliminate as many of those as possible by using every available technique for mapping to the correct AS, and it's certainly possible that some of these cases were issues like that, but some of them when you actually looked at them were obviously happening closer to the middle of the internet. >>: So one thing's not clicking. So ->> Ethan Katz-Bassett: Can I just finish one more sentence on the next question quickly? There's some statistics in the paper on how many looked like they were in the last but one AS, and I don't remember the numbers off the top of my head, but it wasn't all of them. And so those other ones, it couldn't be this boundary issue at least at the edge AS because it wasn't even looking like it was in the previous one. Yes? Sorry. >>: Okay. So there's two possibly -- I'm trying to understand which two cases here, which is these networks you can't reach, is it the case that some -- that at least one of your sample points can't reach them or that all of your sample points can't? >> Ethan Katz-Bassett: So it's the case that some can and some can't. And we had a threshold for a couple not being able to because if just one did, then it could just be a problem with that particular vantage point. >>: In every case at least someone else could -- someone in the sample set could reach it? >> Ethan Katz-Bassett: Yes. We tracked the ones that are completely unreachable. But for the ones on this map, they're partially reachable, because those are the more interesting cases. >>: Okay >> Ethan Katz-Bassett: So in fact it's not just these outages that matter. Sometimes you can actually get to something, but it's just too slow to use. And providers are aware and concerned about these slow routes. So I have up here statistics. Google published a paper in 2009, and one of the stats they had in there was that 40 percent of the connections from Google clients had a round-trip network latency of over 400 milliseconds, and that's just the network latency from the client to the Google data center and then from the data center back to the client. It doesn't include any processing at the data center and at the client. And this is even though Google has data centers all around the world and tries to direct each client to the one that gives us the lowest round-trip time. So to give you some idea of 400 milliseconds, it's as if the traffic was leaving the client here in Seattle, going around the equator to the data center and then leaving the data center and going around the equator again before it goes to the client. These slow routes have a direct impact on business. I have there a statistic from Yahoo. An extra 400 milliseconds of latency led to an additional 5 to 9 percent of people clicking away from the page before it even loaded, and this leads to a direct loss in business. So these huge companies have slow routes even though they have a big monetary incentive to have high performance. They need a way to try to understand these problems, and they currently lack the means to guarantee performance. Yes? >>: So 400 milliseconds is really big. Based on the [inaudible] we did from our data center to the edge of the network, almost [inaudible]. I suspect most of these are due to the latency of the last route rather than the internet. If the latency of the last route [inaudible] >> Ethan Katz-Bassett: Sure. So they had some further statistic thing, and that's not -- I'm not trying to say that that result is what I would say is bad routing. It's just trying to point at the problem that they're concerned about. And they had some more statistics in the paper when they tried to break that down. It's a good paper. Worth the read. They found that I think -- so that statistic was on client connections. They also look at prefixes, and I think they found that 20 percent of prefixes had what they considered to be an inflated latency that was addressable through routing means. So the current internet doesn't deliver reliable availability or performance. That means that we can't really use it for critical services. And if you talk to operators at major providers you'll learn that they end up spending a lot of their time trying to explain and fix these problems. There's been lots of interesting and important work done on how we could redesign the internet protocols to improve this situation, and in fact I've worked on some of that. We're also really stuck with the internet that we have now. We need to improve the internet that we've got because there's so many entrenched interests, we want to improve the services that exist on it now and then enable future services. So I'm going to show you how we can actually do that without requiring any changes to the protocols or the equipment. There's two component to my approach. First I build tools that provide better information and operator can use these to troubleshoot the problems. Once you have these tools, I built please systems that use these new tools as well as the existing ones to try to help move us towards an internet that can actually fix itself. Problems that I address really arise from routing. There's difficult to detect, diagnose, repair. I'm going to explain why later in more detail, but for now there's two big issues. First, the internet is one of the most complicated systems that humankind has ever built, and it has emergent behavior. And, second, it's a huge collection of these independent networks. They have to work together to enable routing, but the operators don't necessarily share information. So to improve this situation we have to address a range of research challenges. We have to be able to understand topology and what routes are available, we have to monitor availability and develop ways to improve it, and we have to understand performance and troubleshoot problems. I've worked on systems to address all of these challenges. Today I'm just going to talk about a few of them, just for time reasons. I'm going to talk about my reverse trace route system which lets you measure the path that anybody is using to get to you, and I'm going to give a brief example of how you can use it to troubleshoot performance problems, and then I'm going to talk about a set of techniques that I'm working on to address availability problems. So that gives the basic outline of the talk. In order to understand the work, I'm going to first have to walk through some background on how routing works and what tools are currently available just so you can understand how the problems arise, and it's going to be very high level. Most of you are already going to know it. Then I'm going to talk about reverse trace route. It's a measurement system that I built that provides information that you really need to troubleshoot the types of problems that I'm talking about. I'm going to show how operators can start using it to debug performance problems. And then the final part, as I said, I'm going to turn to availability problems. I'm going to talk about some systems that I'm working on now that use reverse trace route to make steps towards automatic remediation of outages. So I want to make sure we have the same basic view of what the internet looks like. It's a federation of autonomous networks. Each of these clouds is one network. For a client up at the University of Washington to talk to the web server down at the bottom there traffic is going to traverse some set of these networks. Within each network it's going to traverse some set of routers. Now, many of these routes tend to be stable for pretty long periods of time, even days and days is fairly common. And I'm going to give an overview of how these routes are established. So BGP is the border gateway protocol. It's the internet protocol to establish these internet network routes. The way it works is the web servers ISP will advertise the set of addresses that is available and it will tell this to its neighbors. So now AT&T has learned the routes and now AT&T has a direct route to the web server's ISP. It will advertise that to its neighbors. Now Level 3 and Sprint have routes through AT&T to the web server's ISP, to the web server. They're going to advertise those on. So now University of Washington has to learn two routes. Now, one key with BGP is that routes generally aren't chosen for performance reasons. BGP let's each network use its own arbitrary policy to make decisions, and it's an opaque policy so it's generally going to be based on business concerns. So in this case let's say University of Washington has a better deal from Level 3, so it chooses that route. So now traffic from the client at University of Washington to the web server is going to follow some path like this. >>: And the whole path is transmitted out? So UW can actually make the decision and say it doesn't want any traffic to be through Sprint? >> Ethan Katz-Bassett: It sees a network-level path, yes. >>: [inaudible] >> Ethan Katz-Bassett: It receives the message that essentially it looks like that and it has which addresses are at the end of that path. Exactly. >>: But at some point [inaudible] >> Ethan Katz-Bassett: It's not necessarily enforced, but it will follow that path. But at least it's seeing the route that it thinks it's selecting. >>: I guess you don't have a good case here, but if you have a case where depending on whether it went through L3 or Sprint that there was a third network, then if it went through Sprint it wouldn't go through a third network, but if it went through L3 it would go through the third network, it sees enough information that it could decide to send packets through Sprint ->> Ethan Katz-Bassett: Yeah. And that's a great example of the opaque policy. It's allowed to use whatever policy it wants to make its decisions. >>: [inaudible] >> Ethan Katz-Bassett: It could make decisions based on what appeared in the path. Generally not, it doesn't, but it could. >>: But there is sufficient information in the protocol that it could, right >> Ethan Katz-Bassett: Yes. >>: But it can't actually specify -- the only [inaudible] >> Ethan Katz-Bassett: It can't enforce the path is that it takes, but it could select the path based on where it thought it was going to go and try to avoid these paths if it wanted to. So another consequence of this policy-based routing that paths are often asymmetric. The way that that can arise is the University of Washington will advertise what addresses it has available with a path that just says, hey, send it to the University of Washington. It tells that to its neighbors, so now Level 3 and Sprint have direct paths. They're going to advertise that onto AT&T, and, again, AT&T can use whatever policy it wants. L. Et's say it prefers the path through Sprint. It's going to select that path, advertise that onto the web server's ISP, web server's ISP has a path through AT&T to Sprint to the University of Washington. So now traffic back to the client will go this way and we have asymmetric paths. So portions of the paths may overlap, but generally they're often at least partly asymmetric. Now, each of those clouds that I had represents a network, and in the case of big ones like AT&T or Sprint, they actually span multiple continents. Each of those networks is made up of routers, so we have to go from this internetwork path to a router-level path. Here we have the University of Washington's path that it selected which went through Level 3 AT&T to the web server's ISP. I have a representation of the Level 3 network shown in red, AT&T shown in blue. One possible path through this network looks like this. So Level 3 carries it across the country and then gives it to AT&T. That might be the shortest path through the network. But, again, the individual networks aren't going to necessarily be optimizing performance across the internet so it actually might be more likely than the traffic goes like this. Level 3 is going to give it to AT&T as soon as it can because it incurs costs for carrying it. This way AT&T will incur the cost of carrying it across the country. So the point is that the end-to-end performance that you get and also the availability to get depends the interdomain routing decisions and also the intranetwork routing decisions. This means than the types of problems that I'm talking about are problems with routing. So a performance problem might be a geographically circuitous route like this and availability problem might be when you have a route advertised in BGP but when you actually send traffic along it, it doesn't reach the destination. So we're going to need to drastically improve the performance and availability to enable these future applications we want so we really need a way to understand routing to try to troubleshoot the problems that arise. One attribute of the internet's protocols is that they don't naturally expose that much information. So you might have good visibility into your own local network, but it's hard to tell what's happening in other networks. Furthermore, those other networks don't really have an incentive to tell you what's going on. So if Sprint has a problem, they don't really have an incentive to inform AT&T exactly what the problem is. That means that we need tools that can measure the routes given the restrictions of what the protocols and other networks are going to make available. So traceroute is one such tool. Probably many of you have used it. Traceroute lets you measure the path from the computer that's running it to any network. And it's widely used by operators and researchers. I'm going to give a basic overview of how traceroute works. So all internet packets have a time-to-live, or TTL, field. The source sets the value. Each router along the path decrements that by one. If it hits zero the router is going to throw out the packet and it's going to source an error message and send it back to its original source. So traceroute is going to manipulate these TTL values to build up the path. Here we're trying to measure the path from S to D, and the first thing traceroute is going to do is send out a packet and set the TTL to 1. It's going to get to some first router, say F1, F1 decrements the TTL value to 0, it's going to discard the packet and generate an error message. It's going to send this error message back to S and when S gets the error message it can look at it, see that it came from F1, and now it knows that F1 is the first hop on the path. Then Traceroute's just going to continue this process. It's going to send out a packet now, TGL2, gets to F1, F1 decrements it to 1, sends it on, gets to some F2, F2 decrements it to 0, throws out the packet, source sends an error message, sends it back to S. Now S knows the path goes through F2. And we can just continue this until we've built up the whole path. So if operators at the web server think that they're having problems with some client they can use traceroute to measure the path to the client. But as we said, routes on the internet are generally asymmetric, and the problem could be on the path from the client back to the web server, and traceroute doesn't give any visibility into this path. First of all, the client's computer is going to be the one setting the TTL value and it's going to set it to a normal value. So there's not going to be these error messages generated, and the web server operators don't have any way to control that. And then even if there were error messages generated, they're gonna go back to the client, and so the web server isn't going to observe them. Now I'm going to give a real example of how that limitation affects operations. I'm then going to show how we can address this limitation with my reverse traceroute system without requiring any modifications to the existing protocols or equipment. I started out with some statistics about how many slow routes Google is observing, and they built a system to try to adjust this. So this is an example taken from their paper about that system. In this example they have clients, and instead of networks in Taiwan, shown in red there, and they were experiencing really slow network latencies of 500 milliseconds. So the main thing that Google does to try to have good performance to their clients is replicate their content at data centers around the world and send the client to a nearby one. So the first problem that their system looks for is maybe actually the client's being served by a distant data center. And they can actually look at the logs at the data center. In this case they verified in fact the client was being served by a nearby data center in Asia. Another potential problem is that the route from the data center to the client could be really indirect. But their system was able to verify with traceroute that actually it was taking a really direct path from the data center to the client. Now, the communication is two-way. Paths are likely asymmetric. So at this point they assume that the problem is on the reverse path from the client back to the data center, but they don't have any visibility into that path. So what they concluded was to more precisely troubleshoot these problems they need the ability to gather information about the reverse path back from the clients to Google. In fact, there's been a wide-spread recognition of the need for this service path information. So I attended a network operators troubleshooting tutorial, and I got their quote from this tutorial. The number one go-to tool for troubleshooting is traceroute, but asymmetric paths are the No. 1 plague of traceroute, and that's because the reverse path itself is completely invisible. So you can get a good idea about a problem when it's on the forward path, but you're really flying blind if it's on the reverse path. That gives the motivation for our reverse traceroute system. We want a way to measure the path back to us from the client without requiring control of the client and we want it to require no new functionality from routers and went it to use only existing internet protocols. So this is my reverse traceroute system that I built, and now I'm going to walk through the basics of how it works. So here's the setting. We want to measure the path from D back to S. We control S, and we can install software there, but we don't control D. So the first thing that you might ask is why not just install software at D. And actually a few years ago we had extensive conversations both with Microsoft and with Google to try to convince them to do something like this and they really weren't willing to do that to get the information that way. They wanted a way to work with what was available and enable operators to fix the problems without requiring the participation of the client at all. So, again, what can we do in this setting? Well, we can measure the forward traceroute, but the path is likely asymmetric so it's not clear that that gives us that much information about the reverse path. The next thing that you might think to do is use other computers around the world that we actually do control, and that's what we're going to do. We do have access to other vantage points. In my case I use Planet Lab, which is a test bed that gives you access to computers at a couple hundred universities around the world. These aren't going to directly give us the path from D, but they're going to give us a view that's unavailable from S and also we can combine the view of the multiple vantage points to get a better view than is available from any one of them. There's only a limited number of them, so what can we do with them? Well, we can start issuing Traceroutes to them to destinations around the world and we can build up like an atlas of what routing looks like. And one set of paths that we can measure are the paths from our vantage points back to S. Our idea is to use the fact that internet routing is generally destination-based. What this means is the path from anyplace in the network depends only on where the packet is in the network and where it's going. It doesn't depend on where it came from. So this means that if we intersect one of these blue paths that we know about, we can assume that we follow it the rest of the way back to S. So these paths aren't going to give us the full path from D, but they're going to give us a baseline of paths that we can use to bootstrap off of. >>: So how do you deal with [inaudible] >> Ethan Katz-Bassett: So there are a couple options there. We can use or we do use [inaudible] techniques to try to expose the multiple options that are available, and then when we return a path you can return the multiple options. The other thing is that people have relied on traceroute even given these sorts of multipath limitations now, and so I think just building a reverse traceroute system that gives you a sense view into at least one of the paths is a great starting point, and now we're starting to look into techniques to expose the multiple paths. >>: [inaudible] >> Ethan Katz-Bassett: They don't make that available. And most -- ACAMI [phonetic] has pretty good visibility, but even then, they can't necessarily measure the path from arbitrary routers where they don't have a presence, and anyone else who isn't ACAMI is going to have many fewer points of presence in that. So they're not even going to have that view that ACAMI does. >>: I guess my assertion is ACAMI probably has better edge coverage than Planet Lab. >> Ethan Katz-Bassett: ACAMI certainly has better edge coverage than Planet Lab. But I'm going to show you how we can use Planet Lab's vantage points, the limited number of them, and actually bootstrap them to get much better coverage than you'd think we might be able to. So now we need a way to build from D until we hit one of these paths that we know about, and actually destination-based writing is going to help us do that also. If we're able to learn that the path from D goes through some router R1, because of destination-based routing, we only need to measure the path from R1 back to S. We can ignore D. So this means that we can build the path back incrementally. If we learn that it goes from R1 through R2 and R3, now we just have to measure from R3 back to S. If we learn it goes from R3 to R4, now we've hit a path we know about, we can assume that we follow V1's path back to S and we've successfully stitched together the path incrementally. So I left that out one key piece, right? How do we get each of these segments that we're stitching together? We need something in the internet that's going to let us measure a path a hop at a time. And IP options are going to give us that. IP options are a built-in part of the internet protocol. You can enable them on any packet, but we're the first to recognize that you can use them to build a reverse traceroute system. The first IP option that we're gonna go use is the record route option. Record route allocates space in the header of the packet for up to 9 IP addresses. The first 9 routers along the path will record their IP address. Now, the key with IP options is that they're going to be copied over into the destination's response. So if we reached D within 8 slots the remaining slots will fill up on the reverse path. So to give an example of that, let's say that the paths look like this. With record route we can get these yellow hops. We get five hops on the forward path, the destination for the sixth hop, and then the remaining three slots fill up on their reverse path. So this is great. So if we're near enough to use record route we can get a few hops at a time and stitch together the path. The problem is that the average path on the internet is about 15 hops in each direction, 30 round-trip. And so what are we going to do if we're too far away to use this? >>: [inaudible] >> Ethan Katz-Bassett: So the best place to look for statistics on that is there was an SIGCOMM 2008 paper called DisCarte by Rob Sherwood. I don't know the exact number off the top of my head, but essentially they found that there was widespread enough coverage that we started thinking about what to use record route for that could enable some of these. There certainly might be some networks where we don't have good visibility into them. The key is that because we're just trying to measure back until we hit one of these traceroute paths that we already know about, we often don't have to measure that many hops. >>: So in your study did you find that the majority of the paths support this option? >> Ethan Katz-Bassett: There are some networks that do, there are some that don't. Lots and lots do. I think that -- I believe that the coverage just in terms of networks that's supported is over 50 percent. But I'm not -- I don't know the exact number off the top of my head. I'm going to have some results later on in terms of accuracy, and you can sort of infer from those that we're getting good coverage out of our techniques because we are able to measure the paths using the combination of techniques. >>: [inaudible] does the packet just get copied with the existing recorded routes or do you lose all the information ->> Ethan Katz-Bassett: So there are different ->>: [inaudible] failure mode is that you get nothing or you get a silent [inaudible] >> Ethan Katz-Bassett: Sure. So there are potentially both behaviors. There are some places where the packet will be filtered and dropped, there are other places where it might be forwarded and just you won't see that particular hop. Similarly, with traceroute there are some routers that don't source these TTL expired packets and so you get some blind hops there. So in the DisCarte paper they have some statistics quantifying how many routers have each of these different behaviors, and that's sort of what we built off of for that. In our paper we have some statistics on these reasons, like how many routes might be hidden from -- how many hops might be hidden from traceroute, how many might be hidden from record route. >>: The destination operating system network stack has to respect record route and understand on the response that it has to fill in the previous receipt? >> Ethan Katz-Bassett: Yes. >>: Okay. And that's why [inaudible] does Windows support it? >> Ethan Katz-Bassett: I believe so. We don't tend to probe to end host. We probe to the router right before the end host because that last hop will be symmetric anyway. But it's a built-in part of the IP protocol so in general everyone has -- our machines we tend to test on Linux, and it's certainly supported there. So this is going to work great if we're close. In the case when we are not close, that's where we use the fact that we have distributed vantage points. What we can do is find a vantage point that's close to the destination. So let's say in this case that V3 is close to the destination. It's going to send out the record route probe, but it's going to spoof, which means that it's going to set the source address to S's address. So we're the first to use this source spoofing to sort of separate out the forward and reverse path and measure reverse path information. So what's going to happen here, V3 is going to send the record route probe to the destination. When it gets there some number of the slots will be filled out, let's say seven, D will add itself in, and then it's going to reply, but it thinks that the packet came from S so it's going to reply back to S. When the packet gets back to S, S looks at it, it's now learned that R1 is on the reverse path. So this source spoofing is going to let us use any vantage point that's close to the destination regardless of whether or not it's the particular source that we're trying to measure to. And now because of destination-based routing we can just repeat this process. Now we need to find a vantage point that's close to R1. Let's say that V2 is close to R1. It's going to send the packet, spoof, and claim that it came from S. When it gets back to S we've now learned that R2 and R3 are on the reverse path. Now we have to measure from R3 back. And suppose we don't have a vantage point that's close to R3. We might still have some idea about what the set of possible next hops are, and that's because, as I said before, we're measuring paths from our vantage points to destinations around the world to build up an atlas of routing. So we can look at this atlas, and let's say previously we've observed a path like this, we've observed another path like this. This means that R4 and R5 are good candidates for where this path might go next. So now we need a way, given one of these likely candidates, to verify which one is right. And we're going to use another IP option to do that. We're going to use the time stamp option. The time stamp option lets you specify up to four IP addresses in the packet, and if routers with those IP addresses are traversed in the sequence that you specify, then they should record their time stamps. The key here is that it's ordered. So the way we're going to use this is S is going to send out a time stamp probe to R3 and it's going to ask first for R3's time stamp and then for R4's time stamp. Because they're ordered, if R4 is traversed on the path going to R3 it won't record a time stamp because R3 isn't recorded yet. So this packet is going to get to R3, R3 is going to record a time stamp, and then it's going to reply back to S. When it gets back to S, in this case let's say that R4 is recorded a time stamp, we now know that R4 was on the reverse path. So we don't care about the particular value of the time stamp. We're just using it as a check of whether or not that router was traversed. So we now know the path goes through R4, we've intersected a path that we know about. At this point we're going to assume, with destination-based routing, that we follow V1's path back and we successfully stitched together the reverse path. So I've now shown you how we can address this key limitation of traceroute and measure the reverse path even though we don't have control of the destination. There are a number of keys -- yes? >>: A lot of access networks won't allow packets out if they have source addresses that are not on the access network to prevent that kind of spoofing. Do you find that that's a problem with a lot of the, for example, Planet Lab networks? >> Ethan Katz-Bassett: So it's certainly the case that spoof probes are filtered some places. The most complete study that is from MIT, they have this spoofer project where you can check from your machine whether or not you can spoof. They found about 25 to maybe a third of -- 25 percent to 33 percent of sources can spoof we found similarly with Planet Lab. The key here is that you just need -- if a particular vantage point can spoof, it can generally spoof as everybody. And so you just need a set that can spoof that are well distributed and you can bootstrap the whole system. You don't need any particular vantage point to be able to spoof. >>: So you just throw out immediately any vantage point that happens to not be able to ->> Ethan Katz-Bassett: Yeah. We check which ones can and retain the ones that can't and I think it's on the next slide, we use about 90 right now. So we're using these multiple vantage points to get a view that's not available from any particular one, and we're using spoofing to let us select the vantage point that's in the best position to make any particular measurements. You need to adjust a number of things to actually make this technique work well. So, for instance, some routers don't process these IP options correctly. But we're able to account for these incorrect processing and still get useful information in many cases. We have techniques for doing that. Similarly, some ISPs filter the probe packets, and we have techniques to avoid some of these filters and improve the coverage of the system. Finally, you want to be able to measure without having to issue too many probes, and we have techniques that let us intelligently select particular vantage points that we can sort of maximize the return on any particular probe that we send. So reverse traceroute is a real deployed system. I think as of this week we're issuing around 100,000 reverse traceroutes a day. In our current deployment we're using about 90 Planet Lab and measurement lab sites as our vantage points. We're using Planet Lab's sites as the sources, so this means that you can now measure paths back from anywhere to Planet Lab. You can actually try out the system with a few resources at revtr.cs.washington.edu. Now, when none of our vantage points is able to measure a hop from a particular destination using any of our techniques we have to make an informed guess for that particular hop and then measure back from that one hop. So this means that the coverage that we get is actually tied to the set of vantage points that we have. What we've found in evaluating the system is that you actually need the full set of techniques to get good coverage. So if you don't use time stamp or you don't use record route or you don't use spoofing, the coverage will immediately drop enough that you no longer have a useful system. The overhead of the system is reasonable both in terms of the amount of time that takes to make a measurement and in terms of the number of probes. It's about 10 times that of traceroute right now. I'm going to go into it in a little bit more detail and accuracy. So operator -- yes? >>: I was wondering, do you have a sense of the shelf life of reverse traceroute with respect to where the trends are as to what the routers are doing, like [inaudible] is there a longitudinal sense at all? >> Ethan Katz-Bassett: So there was a study down a number of years ago where they concluded that IP options weren't supported enough to be useful. In recently years DisCarte used record route, we've used time stamp and record route, and we've seen better usage than they did previously. So I think that the trends are in our favor. Also, I've talked about this system to many operators who have presented at operators conferences, and lots of them are signing up to use the system, and I think that hopefully that should lead to sort of good will within the operations community where best practices are to enable these options that you can make she is measurements just like best practices now are to support traceroute in your network so that other people can debug problems that they're having. So because operators are used to using traceroute, we'd like our system to return equivalent results so if you actually have access to the destination we're able to issue a direct traceroute. Now, we don't have access to ground truth information for the internet as a whole, but we can actually evaluate this in a controlled setting. So we're going to look at paths between Planet Lab nodes, we're going to compare a reverse traceroute to a direct traceroute. For the reverse traceroute we're going to assume that we don't have control of the destination and use the technique that I just outlined. For the direct traceroute we're just going to looking into the destination and issue a traceroute and compare those two paths. This graph shows the result of that comparison. On the X axis we have the fraction of the direct traceroute hops that you see. And the graph is a CCDF, so any particular point on the graph says that -- it gives you the tracks of the paths that had at least that fraction of accuracy. Right now without access to reverse traceroute information operators fall back to assuming that the forward path and the reverse path are the same, so the black line gives how well do you if you just assume the path is symmetric. In the median case you get 38 percent of the hops right. So you're really not getting that much information. The red line is our system. In the median case you get 87 percent of the hops correct. And so this whole shaded region is the benefit of our system. >>: [inaudible] >> Ethan Katz-Bassett: We use traceroute service as part of our system. You can't issue the options probes from them, but we actually use them to build up our traceroute atlas. >>: [inaudible] >> Ethan Katz-Bassett: Yeah, we have results on that too in the paper. So the line -- if you sort of find the traceroutes or the path that's the closest to one that you're trying to measure, the line is somewhere in the middle there. And we also use those in evaluating accuracy. Just as we did Planet Lab to Planet Lab paths, we did traceroute server to Planet Lab paths and it was almost the same. The line is a couple percentages different. >>: What are the sources of [inaudible]? >> Ethan Katz-Bassett: So the two main sources are that -- so an ideal line, just so make sure everyone's on the same page here on the graph, is a vertical line at X equals 1. So what this graph shows is that in most cases we get most of the hops right. There are two main reasons we don't get that vertical line at X equals 1. The first is that routers have multiple IP addresses and it can be hard to identify if two IP addresses are in the same router. So one technique might be observing one IP address. Another technique might be observing another IP address. So in those cases we're essentially undercounting. We're getting it right, but we just don't realize it. The second main source is that when we're not able to measure, when we don't have coverage of a particular network, we have to assume that that particular hop is symmetric and measure from that one hop back. So this means that the accuracy of our system will improve as the coverage grows. And in fact I've helped Google deploy a version of reverse traceroute now and they have access to more vantage points than I do, and so as you would expect, they get better coverage than I do on my system. >>: So what about the AS level if here is the IP level? >> Ethan Katz-Bassett: So I -- in the paper we evaluate pop-level accuracy, so sort of in between those two trends. In the pop-level accuracy I believe in the median case we get 95 percent of the pops right. Maybe -- actually it might be even be 100 percent of the pops right. It's in that range. I helped Google do an evaluation on AS level accuracy and essentially what you see there is that in cases where you have the coverage and are able to make measurements from every hop, I think it's like 98 percent of the time you get the AS path right, and most of the time when you don't it's sort of off by 1, so it might be the sort of cases that you're talking about where you're unclear on the boundary which AS you're actually in. So I've shown you now how we can build a working reverse traceroute system. I started out with this problem that Google was having with their clients in Taiwan, and now I'm going to walk through an example of how you can use reverse traceroute to address problems like that. So in this case I measured a round-trip latency of 150 milliseconds from a Planet Lab known in Florida to a network in Seattle, and this is about two or three times what you might expect. Now, the network is in Seattle, but it wasn't affiliated with the University of Washington. It was just sort of random that it happened to be there. With traditional tools, what an operator would do is issue a forward traceroute and see if the path is indirect. So that's the first thing that I did. So if you look at this path, you see that it starts in Florida, goes up to D.C., comes back down from D.C. to Florida and then continues across the country. So this detour up to D.C. and then back down to Florida is explaining about half of the inflation. But with traditional tools you only get a partial explanation. It seems like there were less problems on the reverse path, but you really have no idea exactly what's going on or who to call to get it fixed. With reverse traceroute we can actually measure that path. So that's what I did. So if you look at the reverse path you see it starts in Seattle, goes down from Seattle to L.A., comes back up from L. A. to Seattle and then it continues across the country via Chicago. Now, if you look closer at this path -- so this detour down to L.A. is explained in the rest of inflation. If you look closer at the path, you see what's happening. It's starting out in Seattle in internap, it's going down on internap to L.A. In L.A. it's switching over to transitrail, coming back up to Seattle on transitrail. So it's not necessarily a misconfiguration. To could actually be that transitrail and internap only connect down in L.A. and so the traffic has to go down there. But I was able to verify with a traceroute from the University of Washington that actually they appear here in Seattle as well. And I was able to talk to operators at these ISPs. They verified that this was an unintentional configuration. So it might be that maybe it was out-of-date information or something like that. This type of misconfiguration is a common cause of routing problems, and it's effectively operator error. There's this manual reconfiguration going on all the time, it's easy to get it wrong, it's hard to understand the consequences of any change you make, it's hard to understand the interactions of your change with the rest of the network. Without access to both directions of the path you really have no way to understand what's going on or who to contact to fix it. With access to reverse traceroute even I, as a grad student, was able to talk to people who could fix this path. >>: Talking to the people seems to be the problem, right? How do you call up the ISP and get the right person on the phone? That's a much harder problem. >> Ethan Katz-Bassett: It's a different type of problem, right? It would be hard for me as a grad student to go around debugging all the problems of the internet using this tool. But actually in the Google system that I used to sort of motivate this section of the talk, that's actually what they're trying to get is they want something to point at the problem, and their assumption is once they can point at the problem they can do something about it. They have the clout and the connections to call them up or is they can change the routes or they can install more capacity or whatever it was. So the main thing that they wanted was a way of classifying what the problem was and where it is. >>: It seems like you have to [inaudible] showing one network to the entire rest of the world every place where [inaudible]. >> Ethan Katz-Bassett: So that's essentially ->>: -- generate the list of, you know, here's all the networks that have got to be fixed. >> Ethan Katz-Bassett: Sure. Sure. So this was me debugging one particular path. The paper that they have at IMC 2009 is essentially that system, and the problem is that that system needs reverse path as part of its input and it didn't have it. So that's how they do it. And presumably people at Microsoft do something similar where you want to prioritize where the problems are and come up with solutions that can fix multiple problems and things like that. So the lack of reverse path information has been identified as a key limitation by Google, by operators, by researchers. You really had no way to get this information before, so anyone who cared about internet routing was really only seeing half of the information. I've now built this reverse traceroute system that lets you measure the path to you without controlling the destination, and I gave an example of how you can use it to debug performance problems. Google is now starting to use it in that context. Reverse traceroute is also useful in a range of other settings, including availability problems. And I'm now going to talk about how we can ->>: Before you move on, I'm curious, this whole tool was developed as a sort of end user, from an end user's point of view, right? Why the network operators don't use something -- because they have access to [inaudible] if they really want to inject something there, it's a lot easier for them to do. Is there no incentive for them to do it? >> Ethan Katz-Bassett: They have access to routers in their own network, but they don't have access to routers in other people's networks, and people in the other networks don't have much incentive to give them much information about what's going on in their router. >>: If this is a big problem, why don't the network operators join together and solve it? >> Ethan Katz-Bassett: So public traceroute servers which [inaudible] talked about are sort of a step towards that. If networks make available a website, you can go and issue traceroutes from their vantage points, but there's only about 1,000 of those, and I think it's just hard to get to the type of coverage you really need to debug problems across the internet. One direction of my work that I'm not really going to talk about today is we're starting to look at what are small tweaks you might make to the IP protocols, so maybe a new IP option that we can add or something like that that would make some of these measurements easier and expose some of that information. And we're hoping that we can eventually get buy in. But it's a long, slow process to get any of that -- to have that work at the scale that they need, you need to get standards changed, you need to get router vendors to buy in, you need to get everybody to upgrade their routers to support. So it's not just going to happen immediately, and this stuff I'm talking about today is really how can we start solving these problems right now using what's available right now. Did you ever another? >>: Just a quick followup. You talk about the incentives, right? It seems there's already enough incentive for the different ISPs to support, for example, traceroute, like the standard traceroute. >> Ethan Katz-Bassett: Right. >>: Obviously the reverse one would be harder, right? But in some sense -- I mean, if there was no incentive whatsoever to collaborate in some sense ->> Ethan Katz-Bassett: There's sort of this tension where I think the operators are often willing to cooperate, but then there's also parts of the company that ->>: [inaudible] >> Ethan Katz-Bassett: Yeah. So, I mean, I presented all of this work at operators conferences before I even wrote the paper to try to get buy-in, and I think there's been dozens of operators signed up to try to use the system, things like that. So I think that once they see that you can start making these measurements -- you know, it's sort of like that quote said. People viewed it as a fundamental limitation that just this information was invisible. Once you start showing them how you can actually see it, the hope is that we can get better support, get people buying in and then main eventually they will move towards easier solutions for providing this type of information. So now I'm going to talk about how we can start to build a system around reverse traceroute that's going to automatically identify, locate, and avoid failures. And it's all work in progress. So I'm going to sketch out where we are now, where we're going with t. I'm going to start out with an example of what operators do now. So this is a quote from an email to a network operator at Outages mailing list that I subscribe to. This operator thought that he was seeing problems in Level 3's network in D.C., didn't really know what was going on, and so sent an email to the mailing list to ask people for help. And in the email he included a traceroute. So this is essentially his traceroute. It started out at his home in Verizon in Baltimore, it went down to D.C., switched over to Level 3's network and then trailed off. So when the traceroute trails off, it means that he didn't get a response back from the destination, so it looks like make there's a problem in Level 3 in D.C. But assuming that the destination is configured to respond, it could be that the destination isn't receiving the probe. It could also be the destination is receiving the probe, but the operator isn't receiving back the response. And with traceroute alone you can't tell the difference between these cases. So what do the people on the mailing list do to help this guy? Well, they start issuing their own traceroutes and sending them to the mailing list. So here's another one of those. So this traceroute starts out again in Verizon but in D.C. in this case, switches over to Level 3, goes from Level 3 in D.C. to Level three in Chicago to Level 3 in Denver and then it trails off. So, again, first it looked like the problem was in D.C., now it looks like the problem is in Denver. Did the problem move? Are they two separate problems? Are the problems on the reverse path? With traceroute alone you really don't know which one of these cases you're in. And actually nobody on the mailing list ever indicated that they had any idea what was going on. They just sort of each sent out their traceroutes for a while and then, you know, a couple hours later the problem resolved and nobody ever really gave an explanation. So we'll really like to do better there. So in ongoing work, I'm taking a three-step approach to that. First, you need to identify that a problem is going on. So that's monitoring and figuring out the problems. Once you know that there is a problem you want to be able to locate where in the network the failure is. Once you've located the failure, you'd like to be able to reroute traffic around the failure even if the failure is in a network that you don't control. So as we saw on that Hubble outages map earlier, many of these problems are lasting for hours, sometimes even days. So if we give operators better tools they could likely fix those problems much faster. But the outage is still going to persist until it's fixed. And so even with better tools, operators are going to be fixing these problems on a human time scale. We'd like to be fixing the problems within minutes or seconds instead of. So once we have these pieces for monitoring location and remediation we can start putting them together as building blocks for a system that's automatically going to repair failures, and that's what I'm going to talk about. First I want to give a characterization of the duration of outages. So these are results from a two-month study at monitoring a diverse set of targets around the internet. This graph shows the duration of the problems that we observed. So on the X axis we have duration in seconds, and it's on the log scale. It's a PDF. So any particular point on the graph shows the probability of an outage of a particular duration. There's two points that I'd like for you to get from this. First, most of the outages are short. So 90 percent of the problems last less than 700 seconds. Second point: The distribution has a long heavy tail. The remaining 10 percent of outages that lasted over 700 seconds accounted for 40 percent of the unavailability. So this means that to make a substantial impact on improving availability we really need to address problems across time scales. >>: Do you have a sense of why the graph, the shape is so weird? Why does it go up and then go down? >> Ethan Katz-Bassett: As opposed to continuing to go up? >>: As opposed to [inaudible] [laughter] >> Ethan Katz-Bassett: Oh, you mean why don't you see more short -- like much shorter problems? >>: Well, it seems there's a preference for a certain duration window. >> Ethan Katz-Bassett: Yeah, so it could be that -- the short-term outages are often caused by running -- it could have something to do with how long running protocols take to converge. It could also potentially be an artifact of the particular way we measured. I don't know. The only point that I'm trying to get on the graph is that there is a long heavy tail. I'm not going to talk about the short-term problems today. With the short-term problems -- so we're going to take a -- basically we take a two-pronged approach to addressing these outages, the short ones versus the long ones, and it's partly because you can use different solutions. So you can imagine on long-term outages you can actually get a human involved to try to repair them. It's not going to work on a short-term outages because people aren't going to be able to react quickly enough. And it's also partly because the underlying cause is different, so it's already known that there are these short-term outages after failures during routing protocol convergence. And we actually worked on a system called Consensus Routing to address those. So in that case it was pretty well understood that it was a protocol problem. People had already done the measurements, and we just built the system to address it. And today I'm going to focus just on the long-term problems. In that case they're less understood so the first thing that we had to do is come up with measurement techniques to identify characteristics of them. So identify the characteristics of those long-term problems is the first goal here. We want to monitor problems across the internet on a global scale continuously. And I'm going to refer to these long-term problems as black holes in these cases. There's the BGP paths available. When you send traffic, it persistently doesn't reach the destination. So this is the system Hubble that I showed the snapshot from earlier. Hubble monitor networks around the world continuously. It's the system that I built, but just to save time, I'm not gonna go into any details of how it works. I'm just going to talk about how many of these long-lasting black holes we saw over time. So we did a two-year measurement study with Hubble. We monitored 92 percent of the edge ISPs in the internet. So what I mean by an edge ISP is one that hosts clients it, hosts services, but it's not acting as a provider for any other ISPs. During that time we saw over one and a half million black holes. So each of these cases is a network that had previously been reachable from all of the Hubble vantage points, became unreachable from some for a period of time and then became reachable from all of them again. This graph shows the duration of these problems. So the X axis is duration in hours on a log scale. It's a CCDF, so a particular point shows what fraction of the problems lasted for at least some number of hours. So the purple circle there shows that 20 percent of the problems lasted for over 10 hours. So we were really astounded at how many of these problems there were, how long they lasted, how many of the networks they affected. So over two-thirds of the networks that we monitored experienced at least one of these problems. Dealing with them requires someone first noticing that the problem exists, then they have to figure out what's going on, then they have to do something about it. All of these steps take a long time with current tools. So, for example, last month one of CNN's websites was out for over a week. We'd really like to do better. The first thing we need to do is try to locate one of these problems. So we saw in the Level 3 example from the outages mailing list that it's hard to understand failures with current tools because it's hard to understand the output of those tools. I'm going to walk through what we can start to do to get better information there. So up here we have a destination D. It's at USC's Information Sciences Institute. X, y, and z are Hubble vantage points around the world. Previously they'd all been able to reach D. The system detected that there might be a problem, all of the vantage points tried to reach it, and what we see is X can still reached the destination, Y and Z no longer can. So what can we do to start understanding the failure? Well, the first thing we can do is just group together the paths. So in this case we see that paths through the Cal State University network are still reaching the destination. Paths through Cox Communication are failing with their last hop in Cox. So it looks like maybe there's a problem with Cox Communication. But actually it could be that the traffic is reaching the destination and that it's failing on the reverse path back. Traceroute alone doesn't tell us which direction the failure is in even if we have symmetric paths. Plus the paths might be asymmetric, so it's possible that the reverse path isn't even going through Cox. So with traditional tools we couldn't differentiate these cases. With reverse traceroute we can actually tell which of these cases we're in. The key is that reverse traceroute doesn't actually require the source to send any of the packets. We can send the packets instead from X. We know that X has a working path to D, so it's going to send a packet, spoof a Z. Because it has a working path, the packet's going to reach D, D is going to reply back to Z. So if D has a working path to Z, then Z should receive the response, and that's what we saw in this case. So now we can use reverse traceroute to measure that complete working path. So now we know the failure is on the forward path. Cox isn't forwarding on traffic to the destination. Now, we'd like to understand what happened to cause this failure. If we only make measurements after the failure starts, like the operators on the mailing list, we don't know what change there was that triggered this. So what we do is build up a historical view of what routing looks like, so then when there is a failure, we can look at what routes looked like before the failure. So one possibility is the previous route didn't even go through Cox, the route change over to go through Cox and the Cox path never worked. In this case what we observed was actually the previous route did go through Cox and it went through a router R. Now it's failing right before R. Y observed a similar path. Previously went through router R and now it's failing right before R. So it seems like there's a problem at router R. It's still advertising the path, but then the path's not working. We only know which router to blame because we're measuring this historical information. And in fact in most cases we found that you need to have access to this historical information to understand one of these problems. So this is ongoing work. I'm going to present -- we're running the system continuously. Now I'm just going to give an overview of some of the preliminary results. >>: In the last example you can't really blame R, right? It could be some other router that [inaudible] the route that R had. If you're trying to do router-level localization, that seems not correct in a lot of cases >> Ethan Katz-Bassett: So a few things. First of all, we know that R is still advertising some path because the traffic is still going to R. Second, we're not necessarily trying to say that R is responsible for the failure. What we're really tried do is show paths that aren't working so we can start reasoning about other paths that are working so we can route around it. Because we don't control Cox Communication, so maybe we can tell them, hey, it looks like there's a problem with R. But really what we want to know is that paths through Cox Communication aren't working now so that we can start exploring other alternatives that are working that maybe we can use even though we don't control Cox. And I'll get into how we do that in a second. >>: I guess the question was, like, yeah, so I realize [inaudible] Cox has a problem, but what I'm trying to get at -- I'm just trying to understand is telling Cox that R is the problem, does it buy you anything more than telling Cox that this is the source of the destination that's not working? >> Ethan Katz-Bassett: You can imagine that we might observe other paths through Cox that do work. In this particular problem we didn't, but there might be other problems where we did. And so it seems like there's some value in saying the sets of paths that used to go through this particular router aren't working to help narrow down where the problem is and also to let us reason about other paths through Cox that might be working if we have an alternative of using a path, say, from a different peering point that isn't going to go through R. We're not trying to say anything definitive about R. We're just trying to come up with some cause that seems like maybe it explains the problem so at least we can associate Y and Z with each other. If we didn't know that those previously went through R then we wouldn't even have a way of associating Y's problem with Z's problem. I mean, it would look like they were in different parts of the network because they aren't overlapping at this point. >>: But doesn't your traceroute information tell you that your -- those yellow paths that you see there, you can follow those a traceroute up to but not up to R, right? >> Ethan Katz-Bassett: Exactly. The pieces of information that you don't get if you're just using traceroute if you haven't measured that ->>: [inaudible] >> Ethan Katz-Bassett: And then you also don't know that the problem isn't on the reverse path. If you're just using traceroute, it could be that ->>: [inaudible] >> Ethan Katz-Bassett: Yes. >>: I'm really questioning that I have historical information, it seems like if I'm truly saying -- if you go to Cox and you say, look, here are two paths, I can traceroute them each up to this point and no further, so these routers, wherever they're pointing to next on this path, they're not getting to, that's a lot of information for Cox to be able to identify the next router >> Ethan Katz-Bassett: Sure. Let me give a slightly different example that maybe is more compelling that we also observed. We have statistics in the paper. Imagine instead that these paths are going through different networks and both dying right before Cox. If we don't have access to information about what the routes looked like historically or what routes are being advertised, we don't know that those two are converging on the same point, and so there we won't have this common information about Cox. And so we might even try to select a path that bypasses those networks but goes in Cox, and we might suspect in those cases that it wouldn't work. >>: Okay >> Ethan Katz-Bassett: It's less about -- we're not claiming that we're pinning down -- this is the exact cause. It's more about gathering as many types of information as we can to try to reason as best we can about what's going on. Because you really have such a limited view from externally. So we're running the system continuously now. Just to highlight one point that we've observed. More than two-thirds of these outages are partial. So we have some vantage points that can reach the destination, some that can't. And as I was saying before, these partial outages aren't just hardware failures. It's not just a backhoe cutting a path. There's some configuration that's keeping the destinations it can't reach from being able to find the paths that do work. So in these cases identify where the failure is and identify these alternative paths that exist is a big win. If we actually can give this information to operators, they can get in touch with the networks that seem to be causing the problems and they can essentially, yes, ma'am, at these people until they fix it. But we'd really like to be able to make these repairs faster without requiring the involvement of these other networks that you have to call up and yell at. Working paths seem to exist. They're not being found. It's a big opportunity. How can we find them automatically? In this case let's say that operators at the web server realize that they were having problems with some of their clients. We're going to look at the easy case first. Suppose the failure is on the forward path. In this case what the web server can do is just choose an alternate path that doesn't go through Level 3. So maybe it will route through Quest instead. That avoids the failure. They can also direct the client to a different data center that has working paths. The harder case is when the failure is on the reverse path back to the web server. >>: Did you say web server chooses the path or the web server's ISP? >> Ethan Katz-Bassett: Yeah, the web server's ISP. Sorry, I'm sort of ->>: [inaudible] >> Ethan Katz-Bassett: Yeah. It's not the web server itself. And maybe I should have named the web server's ISP differently from the web server. So this is the harder case. The operators in the web server's ISP don't directly control which path University of Washington is chewing, but they want a way to signal to the University of Washington to choose a different path that doesn't go Level 3. They want to send like a don't use Level 3 message. You can imagine this message gets to Quest, Quest already isn't using Level 3 so it's not going to change anything for them. Advertisement gets up to University of Washington. University of Washington is using Level 3, they'll see this, they'll decide not to use it anymore and route a different way through Sprint. Of course, in BGP there's no don't use this message. So we need a way that's going to work in the protocol today. Let me show you how we can do that. All the web servers ISPs operators have control over is which paths they announce. And so they want a way to change their announcement that's going to force other people to choose different paths. So what do the baseline announcements look like? Well, in the base case they're just going to announce that the path to the web server just going through the web server's ISP. They announce that to their neighbors. Now AT&T and Quest have direct paths, they announce those on, paths propagate through and then University of Washington chooses this path through Level 3 to AT&T to the web server's ISP. Here's what the web server's operators can do to change the University of Washington's path. They're going to announce instead that the path goes from the web server's ISP to Level 3. They'll announce this on Quest. Quest doesn't really care. They're just going to continue routing to the web server's ISP, announce that on, and so on. Similarly, AT&T doesn't really care. They're just going to keep routing to their neighbor the web server's ISP. Here's the key. BGP has built-in loop prevention. When you get an announcement, you look. If you're already in the path, you reject that route. So when AT&T makes this announcement to Level 3, Level 3 is going to inspect the route, see that they're in it, and reject it to avoid causing a loop. So suddenly Level 3 doesn't have this route anymore. This means that the University of Washington isn't going to be getting this route from Level 3 anymore so they're going to have to look for other routes. They're given another route from their neighbor, Sprint, so they're going to choose the route through Sprint instead. >>: [inaudible] >> Ethan Katz-Bassett: So that's one strand of research I'm doing [laughter]. >>: Okay >> Ethan Katz-Bassett: So the -- I mean, it's similar to the argument about spoofing. We're doing this in a controlled manner. In ongoing work I'm working -- I might as well cut over to the next slide since that's what the next slides say -- I'm working to show how you can do this safely. You can imagine that as we redesign BGP to maybe give you some of these security mechanisms, you can't do things like this. We could also think about how to redesign it to let you make these don't use this announcements better. This particular strand of research is about what we can do today to simulate those don't use this messages and this is a technique for doing that. >>: Go back one slide. So isn't -- so isn't, for example, Quest -- let's see. Maybe not Quest. If Quest has extensive links with AT&T and Sprint they might say, oh, I'll send over traffic [inaudible]. >>: [inaudible] >> Ethan Katz-Bassett: So these routes -- I maybe should have used a slightly different notation. Each of the routes that you announce is tied to the particular prefix that it's going for. So it's just going to announce this prefix for the address blocks that includes the web servers, basically its own address block. So it won't be saying to get to any Level 3 address, send it here. Each of these paths is associated with addresses. So the Level 3 addresses will still be associated with the announcements that Level 3 is making. These particular announcements are just for this block of addresses that web server's ISP owns. >>: But L3 can still service the link through GBLX, right, in your example? I mean, you're taking money away from L3 and giving it to Sprint, effectively, on the end. There are peer wars where people could end up trying to harm each other's businesses by sending out these loops saying, oh, I thought your router was down so I just -- you know, it was only a few hours [inaudible] sorry about that. Because now you have to have proof that the failure actually exists >> Ethan Katz-Bassett: Sure. So that's one of the properties that we're trying to demonstrate with this system is that you can do this in an understandable way their you can show that there's a failure. You can imagine we can use our failure location system to pin down a failure, and we also have techniques to identify when failure no longer exists so that you can route back. In this case Level 3 wouldn't be able to route through GBLX. Given this announcement, they wouldn't be able to because they would be getting the same announcement with Level 3 with the lop from GBLX, but ->>: [inaudible] route around AT&T through L3. Right? >> Ethan Katz-Bassett: Sure. So our assumption is that in this case Level 3 isn't doing anything to repair the problem because the problem is persisting. If they had already selected that alternate route then we wouldn't have had to do this in the first place. And there are certainly issues around how long do you wait before doing this, and we're sort of evaluating this in the ongoing work. You can imagine that the distribution of failures says that if the failure is short because they're reconverging to a new path that works through GBLX, then we'll let it go. If the problem is persisting longer, you're not getting your traffic, you want to do this. >>: What if the link is fixed? What do you do to at that point in time to the scenario where L3 fixed the link? >> Ethan Katz-Bassett: So there are these properties that you'd like if you're using a system like this. So we're using this BGP loop prevention as our basic mechanism, and we're starting to evaluate it now. But you want to get properties out of it. I like you want it to be predictable, you want to know when you can revert back to your old path. And we do have techniques for doing that. In that case what you can do is announce a less specific prefix without Level 3 inserted into it and then I can continue to monitor over that less specifically prefix until the problem is resolved. Once it's resolved I can revert my announcement. >>: [inaudible] >> Ethan Katz-Bassett: So I'm not claiming that what I just showed on the previous slide is non-disruptive. What I'm claiming is that we have techniques that we're evaluating that go along with this loop prevention. That's just the basic mechanism. But you can do other things to make it less disruptive, and that's what we're evaluating right now. So in that particular example you don't want to disconnect people who are singly homed, but if I announce a less specifically prefix that doesn't have Level 3 inserted in it then those singly-homed customers are still going to be able to use that less specific prefix. So if Level 3 starts working or if some routes through Level 3 work, they'll be able to use those, but people who are stuck -- who aren't stuck behind Level 3 and whose traffic is currently failing, like University of Washington, can explore other alternatives over the more specific route. >>: So I see you're trying to use this mechanism to deal with traffic failure. I mean, today many web servers already have different locations, and the purpose is to fix traffic failure. So if I already have [inaudible] why do I still need to introduce this kind of sophisticated mechanism that could be potentially cause more harm than benefit >> Ethan Katz-Bassett: Sure. So if you have alternate data centers that have work paths, you should redirect there. This is for cases when you don't have an alternate -- that's essentially the equivalent of choosing an equivalent path out through a different provider. Just go to a different data center. This is for cases when you can't do that. The cases when you can't do that are if they're routing to all of your data centers is similar, like if they -- you can imagine that the University of Washington has these multiple providers. There's going to likely route towards all of your data centers through the same provider. If that provider is not working, you need a way to get them to go to a different provider or else their paths to all of your data centers are going to be failed. Similarly, there are some people who don't necessarily have access to these multiple locations around the world, they only have a few, and we want ways to enable them to get around the failures, too. >>: Well, but people who have -- I think the space that people [inaudible] it's people who are small enough that they don't have geographic distributed data centers but large enough that they have control over their access network and they're making these advertisements. Is there such a person anywhere? >> Ethan Katz-Bassett: I can say that -- we've just recently started actually experimenting with this, and the first people we talked to were Amazon because they're local, and they thought that it seemed like a reasonable thing to do in certain cases. There were problems that they've experienced where they can imagine trying to do something like this. And they certainly have multiple data centers. >>: Yeah. There's something about this that sort of smells like router digital [inaudible] or something, and I'm wondering if -- I mean, if you look at spam, there's a group of people who I think jokingly proposed, oh, we should just write our own computer viruses that go out and infect computers and fix the vulnerabilities so the bad guys won't have control over them. That's one strand of research. But there's another strand of research of, well, why don't we generate blacklists of known bad hosts that ISPs can opt into. It's less sexy because it doesn't have this sort of other aspect to it, but it seems like something that would cause far less damage and would be more accepted by the operator community is to say we're going to set up a service where we keep a realtime list of routers that we believe to be bad and then you can subscribe to that and say, oh, looks like L3 is down according to the service, I'll route around it, as opposed to globally advertising this L3 route without their permission. >> Ethan Katz-Bassett: So I believe that the approaches are complementary. We're not going to have a world immediately where everybody subscribes to such a service. There's not such a service available yet. Even once we do, it's going to be hard to get everybody to opt into that. And the web server is still going to want ways to service customers who maybe haven't subscribed to that service. What we're trying to do in this work is evaluate how effective the technique is. If we can demonstrate that it's not that disruptive, which is what we're trying to demonstrate, then maybe it's something that people will more likely use. It's definitely always going to be like a fallback. This isn't your first line of how you deal with problems. But there are problems right now that aren't be addressed that are lasting for hours, days. There are networks you can call up who even if you yell at them, they're not going to fix it. There's going to be networks who you don't know who to call, they're going to be clueless when you do call them, and you want ways to be able to deal with these problems, and that's trying to address this. I don't think that it's the only solution if you're dealing with outages. You want to direct to other data centers, you maybe want to offer a service like this where University of Washington can learn the information themselves. I think that all these approaches complement each other. >>: You buy that there are clueless ISPs and we have to be able to deal with the fact that they're not going to fix the problem. But in this diagram imagine that it's AT&T who was having the failure. L3 has an incentive to serve their customers as best as possible by routing around AT&T if they're having a failure. So it seems like the right incentive here is to have each ISP be proactively sort of probing and recognizing that one of its -- one of the people that's advertising the route isn't following through and actually routing as advertised. And so then here's an alternative. And it seems like that that sort feels like the right way to fix the problem as opposed to the web server that's trying to police every ISP in the world on behalf of its customers. >> Ethan Katz-Bassett: The Level 3, though, would have to police every route from everyone of its, you know, upstreams to everyone on the other side of it. And it's not necessarily clear that it has a great way of knowing what traffic should be going through at any point in time. If it stops getting traffic one way, it doesn't know if that's because somebody changed routes. If it is sending the traffic, it's not entirely clear. I mean, then everybody would have to be continuously probing in each direction to detect those problems. Otherwise, the web server's ISP is going to need a way to signal to Level 3. That's not going to happen any time ->>: But what you're proposing is probing anyway, just in a different place, which is to say every web server >> Ethan Katz-Bassett: Right. Which already -- but that's the sort of probing that already happens. You can already passively monitor your traffic, identify when you have dropouts in clients. I agree that there are other approaches that are maybe nicer ways of dealing with the problems, but any of those is going to require additional infrastructure, require additional people to buy in, and I just don't think that those are going to be available tomorrow, whereas I've already experimented with this, I've been able to do this and change paths like on the actual internet, and so this is sort of a solution they can use. I think that work towards proposing how we might want to redesign protocols or layers we might want to add to address these problems in different ways is -- I mean, it's definitely important to do that, and I've done a little bit of that as well. I think that we also need to solve the problems today, and that's where this piece fits in. So I think through the process that I ended up talking about these three main points that we want. But essentially the point is that the loop prevention is the basic mechanism. We have other techniques to try to get some of these nice properties that you want like not cutting off people who already have working paths, successfully rerouting people who don't, understanding when to stop doing this and when to start doing it. We had this statistic from Hubble before that 20 percent of outages last for over 10 hours. So that's 600 minutes of unavailability. If we were able to locate one of these problems and generate an automatic response to it within a few minutes, we've decreased the amount of unavailability by two orders of magnitude. And that's really the vision that I want to leave you with. Should I head to the conclusion slide? >> Ratul Mahajan: Just, yeah, wrap up. >> Ethan Katz-Bassett: Okay. The internet is hugely important, so we really need it be reliable. I started out by talking about the fact that the internet has suboptimal performance and availability, and we need a two-prong approach to addressing this. I talked about two systems to illustrate that approach. First, we need better tools for operators. Traceroute has been the most used tool for trouble shooting these problems, but there's this fundamental limitation of not giving you reverse path information. I built a reverse traceroute system to address that problem. It's really essential to understanding the internet. Second, you need systems that can use these tools to automatically fix problems, and I illustrated how we can start to do that by identifying, isolating and routing around failures. All my systems are designed to work in today's internet, so it's really about getting as much information as you can by using the protocols in novel ways. Because they work in today's internet, operators are able to directly apply the work, and I've present both pieces of this work at both RIPE and NANOG, which are the big European and the big North American network operator conference. RIPE and Google are now working to deploy these systems in order to improve the internet and start using them to improve the availability and the performance. Thanks. I'm happy to take questions. [applause] >>: So you focused around the network as the cause of [inaudible] do you have any numbers about what are the portion of sluggishness or [inaudible] or do you just service failures? >> Ethan Katz-Bassett: It's not something that I've looked at. The focus of my work has been on these interdomain routing issues. Certainly there are problems at both ends, and you need to be addressing both of them. I guess because this has been the focus of my work, I'm not sure how many problems fall into the other domain versus this one. There's certainly plenty of problems in this domain for people to try to solve and plenty in the other. >>: So one of the things that people argue [inaudible] is that the internet is sort of shrinking >> Ethan Katz-Bassett: Yeah, definitely. >>: So can you comment on this trend in terms of your work? >> Ethan Katz-Bassett: So there's been some interesting work on that. So on the one hand the internet is shrinking for people like Comcast and Google and probably Microsoft as they try to get more direct connections, things like that. It's also -- if you look at the median path link, I don't think that that has changed that much, and that's a lot because the internet is also expanding into, say, developing regions and things like that where they're not going to have these really rich -- this really rich connectivity. So any particular path may have gotten shorter, but paths distributed across the internet haven't necessarily gotten that much shorter. I actually think that this work fits pretty well into both those trends. One issue with this sort of collapsing topology of the internet that you brought up is that you sort of have less visibility into it. It's hard to measure these peering links -- you can only measure these peering links if you have a vantage point either down here or down here. If your vantage point is out here, you're not going to be able to traverse that. Just because of policy, they're not going to export that link. One thing that we show in the papers that you can start observing more of those links if you actually have reverse traceroute information because you can sort of inject packets into the middle of that have using spoofing and observe more of those links. So I think that this helps gives visibility into parts of the internet that maybe are hard to see with traditional techniques. I think the other interesting implication of both these trends, like the collapsing topology and then also expanding the developing region, is that both of those tend in certain ways to make the internet a little bit more brittle. If people are getting rid of their tier 1 providers or having fewer tier 1 providers because they're connecting directly, suddenly they don't have this essentially backup path where they can fail over to the tier 1 and it's much easier to start partitions pieces of the internet off. So I think that it becomes important to have these debugging tools and tools that give visibility into those parts of the internet because even as they improve performance, they don't necessarily improve the resilience of the network. >>: I'm curious just to hear, what are your next steps [inaudible] or something else? What's your longer term plan? >> Ethan Katz-Bassett: I missed one part of the question. What domain did you mention? >>: [inaudible] >> Ethan Katz-Bassett: So I'm interested in looking at sort of both types of -both continuing in the space and then also expanding into new spaces. So I talked today about how you can use reverse traceroute to get visibility into sort of the round-trip path. So I only had visibility in half, now I can observe the round-trip path and start ->>: [inaudible] >> Ethan Katz-Bassett: So all I was going to say is that we're starting to build up these reverse path maps where you can observe suddenly the routing of the entire internet, and that lets you -- it's not so much about having reverse -- that it's reverse traceroute, it's that it gives you a view into complete routing of the internet. And so I want to start using this new visibility where now I can see how everybody routes to me. I can start seeing the routing decisions that they're making and reason about what paths they weren't choosing that they might have select and sort of use this to answer questions about what policies are available on the internet, what policies are in use, what are the cause of particular routing changes. I'm also interested -- I think I'm going to continue on in the internet measurement space as well. I have a number of ideas along that. But I also want to use my experience in interdomain routing to start addressing other questions. So as we move towards having more services, more data in the cloud, we're going to have this need for high availability and high performance, and one part of that is these troubleshooting tools that I've talked about. But I'm also interested in looking at other questions that arise in that space. What type of network knowledge do cloud services need in order to optimize their use of the internet, how can we provide that information to them. They're not necessarily going to have the internet expertise because they're sort of outsourcing that to the cloud infrastructure. How can we give them access to that information in ways that they can reason about. There are also a number of emerging challenges in the internet. So the collapsing topology is one. Mobile -- I'm interested in sort of starting out from this interdomain space and expanding into those other spaces using what I know here to start -- mobility challenges, for instance, are going to require addressing interdomain problems, but they're also going to require addressing problems at other layers. I'm interested in moving into those spaces. >> Ratul Mahajan: Thank you. [applause]