1 >> Jaeyeon Jung: It's my great pleasure to introduce professor Nick Feamster. Nick is a tenured associate professor at Georgia Tech, and he's been working on a lot of interesting research problems related to network systems, including network architecture, protocol design, security, management, measurement, anti-censorship. So to many of you, Nick is not a stranger, right? But to properly introduce him, I downloaded his most recent CV, which is 43 pages long. So I spent some time studying his CV. And so Nick has published nine journal papers, 39 referred conference papers, 27 workshop papers, and five best paper awards. And he has received many prestigious awards including NSF presidential courier award for scientists and engineers and the usual things like, you know, career award, Sloan fellowship and IBM faculty award, whatnot. You might reference in 2010. office at weekdays, CV today, think that oh, must be out there for many years, but to give you a point, he was chosen as MIT technology review top innovators under 35 And to add a little bit of a personal touch, Nick and I shared an MIT for many years, and seeing him working day and night, weekend, I thought I was the biggest slacker in the world. After reading his I feel the same way again. Oh, well. So okay, without further ado, Nick. >> Nick Feamster: Thanks for the introduction. I didn't know any of those statistics. Great, so hopefully I can teach you some things today about what I call the battle for control of online communication. This work, you'll see, has several facets. And it's joint work with several of my students, as well as some other faculty members at Georgia Tech who work in both areas of security and machine learning. So we'll see where I bring those elements of computer science into the work that I do in various portions of the talk, and I'll try to point those out as we go. Briefly, before I get into the technical part of the talk, I just want to talk to you sort of broadly about what all those papers are kind of about, and the general approach I take to my work. So generally speaking, I perceive problems in an area that I call network operations and security, and the idea there is basically like how do you design 2 tools, algorithms, techniques to help people run networks better and, in particular, secure them better, make them perform better, make them easier to troubleshoot when things go wrong, et cetera. And the flavor of work that I tend to do, basically, draws inspiration for the problems from domain knowledge. So I like to basically talk and sort of interact extensively with people in the network operations community than other folks in the trenches. For example, for those of you who are familiar with, who have worked in networking, you've, of course, heard of the North American network operators group. I talk with operators there to sort of identify practical problems with hard underlying questions. There are other groups I tend to work with as well, like the message abuse working group, as well as campus network operators and so forth. So I try to basically spend some time with people in the real world of networks to try to understand what sort of practical problems exist that potentially have some kind of hard underlying question. Those problems tend to be pretty messy in practice, of course. So then basically what I aim to do with my research is to try to abstract those problems and model them and reduce them to problems that are easier for us to understand and model. So and I'll present several examples of how we do that in the problems that I'll discuss today. Then sort of having come up with some kind of -typically, in trying to divide solutions to these problems, the work that I do, I basically try to draw on a number of techniques from other areas of computer science, not just networking. So in particular, as I mentioned, security is another area. But also machine learning as well, and I'll describe that in a little bit more detail. Having sort of come up with a solution, however, I find this not quite the end of the story. What I then try to do with that solution is to try to engage with industry and transfer what we've come up with on paper into practice in a number of ways. In particular, again, I'll point out in various aspects of the work where we've basically taken some of the algorithms and features that I've identified in various contexts and worked with a bunch of folks in the industry to transfer some of our findings into practice. So that's kind of the general approach I take to problems and sort of broadly speaking, there are a number of, if you sort of go to the trenches and sort of figure out what, ask what kind of hard problems exist, there are a number of 3 hard problems that -- or practical problems with hard underlying questions that I've pursued. In particular, this is the one that I'll focus on today, which is like there's this balance between how the internet has been designed to be open and that openness actually on the flip side makes it, you know, more prone to attacks of various kinds and figuring out ways to sort of appropriately balance that is sort of the general sort of high level theme to the talk. There are a bunch of other problems that I've worked on with my students and if there's time towards the end of the talk, which maybe there won't be, I will elaborate on some of those as well. So that's the hard question that I want to focus on today, and I just want to briefly point out some of the areas that you'll see in today's talk. So I will spend my time focusing a lot on spam filtering and message abuse. In that work, we've basically drawn on some techniques from the theory community and machine learning community to develop techniques to sort of combat problems in that area. Also, I'll spend some time talking about problems in anti-censorship and maintaining availability in the face of adversaries who want to disrupt communication and, again, I've sort of worked with folks across areas in that domain. And I'll talk a little bit about some more recent work that we've been doing with propaganda and filtering. So I think the sort of powerful facet to this line of work is that there are a lot of adversaries out there who would wish to sort of disrupt communication, abuse the network for their own purposes, et cetera. And the good news is that we can use tools from various aspects of computer science to in some sense level the playing field. Okay. So let me come back a little bit to this bigger question, like the sort of tension between the openness of the internet design and how that openness sort of makes it more vulnerable to attack. First, what do I mean by openness? I think this quote from the director of the media lab, which is in the New York Times not too long ago really defines sort of the ethos of the internet design. Which is to say that everyone should be able to connect, to innovate, to program without asking anyone's permission. There's no central control, and the assets are widely distributed. There isn't one particular owner. So that's good. There are many good things about that, and the positive aspect 4 of So to to that is that this openness has catalyzed just a huge amount of innovation. the number of users on the internet is growing. The internet is expanding many, many different geographies. And, of course, we're seeing the ability connect from all kinds of different devices. The flip side, though, so what I want to talk about in this talk is the flip side of that coin, which is how openness facilitates abuse and manipulation. >>: Are you assuming a causative link where it actually may not exist? Are you thinking of just the rise of [indiscernible] -- essentially, the pace of that is fascinating and those platforms, by traditional measures, closed. But they're still being adopted. The number of uses they're getting is humongous and [indiscernible] cell phones or feature phones. You couldn't change them in any way. And the adoption rates were way better than the internet. So what is the role of openness here if those platforms are not open? >> Nick Feamster: I think specifically here, we can think of openness being the IP stack. So in particular, even though, you know, many aspects of those platforms remain closed, if you're able to implement an IP stack, you can get a device online. And I think that's, I guess, to focus the attention, that's really what I'm talking about here. >>: But feature phones took off without an IP stack. >> Nick Feamster: >>: Feature phones? Like regular dumb phones. They took off without an IP stack. >> Nick Feamster: Sure. I wouldn't say like this is a necessary condition for growth, right. I mean, there are plenty of technologies and platforms that have grown without being open. I would say that in the case of the internet, though, it's certainly acted as a catalyst. There may be other things that have caused the growth as well. But certainly, I don't think you can argue that it has hurt. >>: I could, but -- >> Nick Feamster: talk about today. Good, okay. Well, but so that's not really what I want to I want to focus on the flip side of the coin, which is by 5 virtue of the fact that the internet is open, and specifically to what I was just mentioning, the fact that pretty much anyone with an IP stack can connect and start to send traffic, that facilitates abuse, manipulation of different kinds. And that's sort of the central question that I want to focus on today. I want to talk about this tension in the context of a couple of problems. And the first is in securing communications. In particular, I'll speak about message abuse. So depending on the statistics that you choose to believe, anywhere from something like 80 percent to 95 percent of email traffic is spam. While you may not see it in your inbox, the network operators who are running those services definitely do see it and they have to do something about it so that it doesn't end up in your, right in front of your face. So this remains sort of a potential continually vexing problem and I'll spend probably the bigger balance of the talk talking about things that we've done in that area to sort of combat that. The second topic this I want to speak about is, you know, on the flip side, maintaining openness. How do we ensure or how do we help parties communicate in the face of organizations, countries, governments, et cetera who would wish to block or disrupt that kind of communication. So you may or may not know that something like 60 countries around the world control or censor internet communications in some form. So this is a problem that's fairly pervasive for citizens of many countries. I won't spend too much time on this last topic, but this is something that's become a recent interest of mine is in sort of the more subtle, one of the more subtle aspects or facets of information control is not just the decision to block or permit a certain type of communication, but rather there is the potential to say manipulate what a particular user sees when they go searching for a particular thing or when they read a particular piece of content or news story or blog post, et cetera. And I will talk a little bit about some more recent work that we've been doing to try to help maintain transparency so that users can, hopefully, become more aware of those types of manipulations. So let me just into the first topic of spam filtering. So as I mentioned already, spam is certainly a nuisance. It's becoming less of a nuisance for us, because we hardly see it. But just because you don't see it doesn't mean it isn't there. It still remains about 95% of all email traffic and a 6 significant fraction of that traffic is of that spam is coming from forms that -- creative forms, you might say. I'll explain in just a little bit why that's relevant to this particular talk. So the other thing, I guess, that's relevant is that a lot of this spam is coming from compromised machines or networks of compromised machines are commonly called bot-nuts. And on one hand that's a bit of a scourge, but on the other hand we're going to be able to use that to our advantage when we talk about how to separate the good from the bad here. So the general approach to the problem of spam is let's filter it, right. This is sort of the obvious thing, right. Obviously, you want to basically take the unwanted traffic and tease it apart from the bad stuff. I'm sorry, you want to take got stuff and tease it apart from the bad stuff. The question, of course, is ->>: You got it the right the first time. >> Nick Feamster: Exactly. And the question, then, is what features best differentiate the spam from the legitimate mail. There have been -- this is not a new question. This has been studied since, essentially, the advent of email. And there's a large body of work in various types of approaches to this problem. I'll talk about a couple of existing approaches to this problem and sort of where that leaves, why there's still some room for improvement even given these techniques. The first, and I'll go into this just briefly is in the incoming slides is content-based filtering. So you can, for example, design a filter that looks the content of a message; i.e., what's being said, and try to figure out based on what's in the message whether or not this is something that the user is going to want to see or not. The other thing you could do is you co sort of, you know, if there's a mail server connecting to your receiving mail server, you could look at the IP address of the sender and try to put that on a black list. So you could develop a reputation for that IP address and say based on the behavior of this IP address in the past, I think this is good or bad. The approach that we take and the approach I'm going to focus on in this talk is complementary, and that's basically to say that we cap also look at features of behavior, right. So we can basically say not just what's being said and 7 who's sending it, but how is the message being sent in terms of what time of day is it, what time of day is it being sent, what ISP is it coming from, what other kind of behavioral kinds of patterns can we see just in the network traffic that stand out. The intuition here is that spammers send fundamentally act in ways that differ from the way you or I act, and we should be able to, if we can identify what those features are and how they stand out, we can key off of those to design filters as well. So let me first talk a little bit about the other two, quickly talk about the other two approaches and kind of where they leave some room for improvement. So content-based filters, as I mentioned, look at what's being said, and one of the things to realize about that, if you sort of talk to the operator of a large mail service provider, for example folks we've talked to include Yahoo, secure computing, et cetera, what they will tell you is there's something like 100,000 different ways of spelling Viagra. That's sort of just to illustrate the point of how difficult this type of thing can be, right, and how asymmetric the attack is, right. So here are some examples of other ways that spammers use content to sort of turn the battle in their favor. So they can take a message and embed it in a PDF or an excel spreadsheet or an image or even an MP3, and on one side, it's fairly easy for a spammer to embed a message in a new type of carrier, if you will. On the flip side, the filter maintainers have to design ways to understand, parse, extract, et cetera from different types of content. So this is certainly something that email service providers spend a lot of time doing, but it's definitely an aspect where the battle is a little bit tilted in the spammer's favor due to the fact that, you know, it's relatively easy to evade these content filter in comparison to sort of updating the capabilities of that filter. The second approach, as I mentioned, is you could take the IP addresses and put it on -- assign a reputation to that IP address. If you look at your mail headers, you would see something like what's called a received mail header. Of course, if this message is coming from a spammer, you'd see a string of these things in the mail header and a lot of them would be forged. But at least one would think, I'll explain to you actually in a few slides why this is not always the case, but one would think that in most cases, this IP address that's completing a TCP-3 handshake with you is the IP address of someone you think it is. And if I could -- if the recipient could keep track of the behavior of any 8 of those particular IP addressing that are connecting to the server, then they can decide what they think of that particular IP address. Is this a likely spammer or legitimate sender. Now, that actually works pretty well. There are large organizations that have done pretty well at maintaining these kind of black lists. But again, this is a bit of a cat and mouse game. And the challenge here is that the IP addresses of email senders are never the same on any two given days, shall we say. One of the experiments that we did actually to study the behavior of these senders is actually we set up what's called a spam trap or a spam honey pot, if you will. It was a mail server with several domains that had no legitimate email addresses. So what a typical mail server would do is basically just reject any attempts to send to nonexistent email addresses. What we did in this case was basically accept any connection attempt and said okay, thank you very much. We will deliver that. In fact, we're delivering it to our spool, but to no one in particular and then gathering statistics on who's talking to us. When we did that, we basically see that on any given day, there are about 10 percent of these senders coming from IP addresses that we haven't seen in the past. So that sort of churn on the black hat side, if you will the bad side of things and there are possible causes for that type of thing happening. We can't necessarily attribute the cause of the churn to any one thing in particular. But there are, you know, malicious reasons for IP addresses changing on any given day. But I'll take the question in just a minute. But there are also good reasons why you might see email from an IP address you've never seen before. So, for example, the renumbering of a mail server or someone just decides to set up a mail server, you know, on any particular day that they hadn't been operating in the past so there are good reasons for IP addresses to suddenly start sending mail as well. So you can't just say let's just black list everything that's new. So coming back to the sort of goal of openness and the desire to keep your false positives low, the ephemerality also becomes a problem. >>: You can't paint 95 percent of the internet black and still call it open. 9 >> Nick Feamster: for improvement. Exactly. So that's essentially where this leaves some room >>: So you started off saying that, and it's true that most of us don't see a lot of spam because of [indiscernible] getting pretty good at these approaches. Who actually does see the spam? And why are there mail servers not doing even these basic thing, which would cut down most of the spam [indiscernible]. >> Nick Feamster: So these basic things actually turn out to cut about 80 percent of connection attempts. So you can basically take an IP black list and still, I guess not so well known fact is that operational mail servers do drop about 80 percent of the incoming connections, just based on things like IP reputation. So the dirty secret behind what I'm presenting to you here is actually that the gains are somewhat incremental, because you're taking that 80 percent that is already -- you're taking the 20 percent on which the early decision hasn't been made and you're basically trying to crank that up a bit. So I think the answer to your question is that, yeah, already this is being done to quite some degree, and we're basically looking for other features, et cetera, that can help us gain additional advantage. >>: But then if no one's seeing the spam, then shouldn't the spam rates go down to zero? >> Nick Feamster: Well, the issue is that actually, there's still about 20 percent of the connections that do, you know that do get accepted. Now, of that, then content filters, et cetera, get applied. So you may not be seeing -- there's some fraction of that that you do see, it's actually quite small. But then the other stuff that you never see also presents a problem as well, because you've got to store it. So there are operational challenges as well. Like once you've basically decided to accept a message for delivery, you've got to do something with it. And the more you can basically shave that down as well, the better off the operators of the service are as well. >>: A follow-up on the [indiscernible] question. Maybe some people do click those [indiscernible] ads. So maybe my spam is not his spam. So as long as 10 those people exist, it may be hard to eliminate spam because some people do want to receive ->> Nick Feamster: I think if no one were clicking on them, then obviously that would be end of story, right. But there has to be some small fraction of people who are actually buying the stuff, yeah. >>: What was the scale of the gathering here? 10 percent new every day, it's hard to imagine that Hotmail and Gmail are seeing that level of new IP addresses. How long was your time window and kind of what did you do to advertise these? >> Nick Feamster: So this is like, so advertising is a tricky thing, actually. We basically, as it turns out, a lot of the baiting seems to come from who is scraping, right, looking at newly registered domains because we did actually put the domains out there. And to not much effect for a while, actually. Basically, so what was the question, the scale? >>: Did it actually plateau? >> Nick Feamster: So we did this over the course of about four months. So yeah, you're right that eventually it's bound to plateau because there are a limited number of IP addresses out there. >>: That was your plateau rate, right? >> Nick Feamster: >>: Yeah, that was our plateau rate. So Gmail would see more? >> Nick Feamster: Eventually, they'd have to because they're going to see a lot more on any given day. In fact, maybe they plateau after a day. Yeah. Okay. So that's basically where the state of the art is and where there's some room for improvement. So as I mentioned, this is essentially the approach that we're taking. And if you're going to talk about looking at network level behavior to try to do detection, the obvious question then is, well, what's different about the spammers, right. So sounds nice, right. Intuitively, spammers aren't like us. 11 Presumably, they should behave differently. But what exactly is it about them that looks different? And what I'm going to talk about is three different ways we've observed spammers behaving differently from legitimate senders. And intuitively they make sense and I'm going to try to drill down into each of these and show how we've taken these kind of axioms, if you will, and derived more features, more low-level features and detection methods based on this intuition. The first is what I call agility. The idea here is that spammers actually have to move to escape detection. So if spammers always sent from the same IP addresses, sent email from the same IP address, if they always hosted their pill sites and phishing sites at the say URLs or the same domains, it set, then eventually all those places would end up on black lists or shut down, and things wouldn't work so well. So spammers actually have to move around to escape detection. So on the one hand that's kind of inconvenient, right, because you create this cat and mouse game where you continually have to, with the techniques I've already described, you continually have to update black lists and so forth to keep up with that. But on the flip side, what we can do is actually recognize that the way that spammers change where they're doing things, where they're performing their activities, differs from the way that anything else on the internet changes. And I'm going to basically use that intuition to show you some particular features that really stand out in terms of the spammer paver. And the second is that spammers just send mail in ways that you and I don't. So just in terms of the way that they send messages to people just look a lot different. And the other sort of keys, the last one sort of keys off the idea or the observation that a lot of spam is coming from these bot-nets. These networks of compromised machines. As a result of that, what we can see is some coordination that wouldn't otherwise pop out from groups of legitimate senders. The obvious thing here, right, and one of the things that we study is that sending behavior actually exhibits some coordination. But I'll show you actually another pretty cool and interesting example of coordination that also popped out as well when we get to that part of the talk. So I'm going to first talk about agility. In particular, I'll talk about how spammers have used various internet protocols to move around. Then I'll talk 12 about how we've -- different aspects of spammer behavior that look different from legitimate senders and how we built supervised learning classifiers on top of that to help differentiate spammer behavior from that of legitimate senders and then I'll talk about some of the behaviors that tend to cluster well. Just to sort of whet your appetite here, one of the coordination behaviors actually that we'll key off of is that spammers actually send mail to themselves so think about that for a while and then I'll come back to it. So let's first talk about agility. One of the things that we observed, and this comes back to the data collection method that I spoke about before. So basically, what we could do is set up spam honey pot if you will, or spam trap and see who is sending us messages. The other thing we can do is sort of join that with our view of what the internet routing table looks like at any particular point. And then ask is there any kind of correlation there between the two things that we observe. So here's something that we see, and I'll point these things out as I walk through this example. So when we look at the internet routing table, one of the things that we actually is an advertisement for an IP prefix that lasted only about ten minutes. For those of you who don't know what BGP is, by the way, I should have just mentioned. It stands for border Gateway protocol. And this is the language that ISPs use to talk to one another to advertise reachability to a range of IP addresses. So they can say, hey, I know how to reach this range of IP addresses. Please send your traffic through me to get there. Okay so what we see here between this red dot and the blue dot, the red dot is an example of an IPs saying coming through me to reach the set of IP addresses and the blue dot is like ten minutes later, you see a retraction of that statement. It's called a withdrawal message. And this is already looking kind of weird, right. Because you see a range of IP addresses that's advertised for a very short period of time. When we run networks, typically we like to have our network up for more than ten minutes, right? So already this is kind of looking kind of a little bit strange. Then the next thing we saw is if we sort of look at what's happening in that range of IP addresses, in terms of who's trying to talk to us, we saw something kind of interesting. We saw in this case five different -- in this particular episode, five different IP addresses contained within that part of the network 13 that are talking to us, like sending us spam. So this is pretty strange, right? You have short-listened network where the reachability is extremely short-lived. And then inside that ten-minute window, you see some activity. Now, that's weird enough in and of itself. But then if I were to ask you if you were to steal a region of IP address space, would you steal a big region or a small region? And we thought really, you know, probably you'd steal a small region of IP addresses because smaller, people less likely to notice. In fact, actually, this particular behavior we observed the opposite. So we actually saw these short-lived announcements popping up for, like, huge regions of IP addresses, slash eights or about 1/256 of the entire internet address space was being advertised in this sort of short-lived way. And you're thinking like, what the heck, isn't someone going to notice if someone like steals 1/256 of the internet? Well, this is kind of brilliant attack-wise, right, because of the way internet routing works, we know for those of you who know how it works, we know that it works on what's called longest prefix match. The idea is that if there's some network that's advertising a more specific range of addresses then that's always going to win. The routers will continue to forward to the guy who's advertised the more specific space. Meanwhile, the attacker who has grabbed this sort of less specific space suddenly got a huge chunk of addresses that isn't likely to be filtered because it's a big chunk. Short prefixes tend not to be filtered. These things are actually allocated as well so it doesn't look too fishy. At least no fishier than it might otherwise look and they get a huge chunk of addresses. Anything that's not being advertised by some other existing network suddenly basically they've owned. So that tends to be convenient. >>: So aren't these attackers sophisticated enough to look at the BGP table from route or somewhere else and pick out more specific blocks that are not being advertised and just advertise those? >> Nick Feamster: I think they probably could. It seems like that might work. One thing, though, like one reason that might not work is sometimes, operators 14 do set up shorter. actually, well. It >>: their filters to filter more specific regions if they turn out to be So that might be one reason why it might not work. But it's I wouldn't say that it's -- I can't say that it's not happening as may be happening as well. Good question. So seems the study was done six years ago. Has the situation changed? >> Nick Feamster: So we most recently looked at this last year as well. So there's still a significant number of short-lived IP addresses that are also sourcing attack traffic of different kind, both spam and other types of scans. So I haven't looked at it in the last year, but this is like behavior persisted up until about a year ago. >>: So how do you propose to defend against this? >> Nick Feamster: That's a tricky question, actually. It touches a little bit on the later part of the talk. I could spend quite a bit of time talking about the philosophy here. I mean, obvious thing to do is to sort of design better filters or, I'm sorry, be more vigilant about updating the filters. Of course, we know that that's tricky. But that would be kind of the ideal situation. Another thing you could do is talk about something like secure BGP, right, where you can't actually advertise a prefix without it actually being assigned and attributed to a particular network. I actually think that there are some stones unturned there as far as like that isn't necessarily a panacea. Then basically who's deciding who is allowed to announce what? I think that doesn't necessarily solve things. You potentially create a situation where things are less open. So I think the right answer, which is probably unattainable, is have operators be more vigilant about updating their filters. But maybe there's a better answer to that that we just haven't thought of yet. Okay so that's one example. Another thing that I mentioned that I was going to talk about was when spammers send messages. They obviously want someone to click on something and buy something. In order to do that, they need to host a site somewhere. They need to host the website that's the Canadian pharmacy or the fishing site or what have you. And the way that, the problem with just hosting a site in any particular place is that if you leave it in that place for too long, the infrastructure will be black listed and shut down. 15 So what attackers say about that, or what they do in response to that is actually use the naming infrastructure to move their infrastructure around. So this is a picture taken from the honey net project. And basically there are two ways that this can be done. One is that you can just use the DNS in the normal load balancing kind of way, when a client looks up a particular domain, you can return different IP addresses. So you can change the IP addresses in the A-record. And that's often called single-flux. It's kind of like a black hat load balancing. Now, the problem with that approach is actually that this thing right here, what's called the authoritative name server for that particular domain isn't moving around as well. So if I were to try to identify what's going on here and shut things down, I could black list this authoritative name server. What the attackers do in response to that is just take this thing and move it on to a bot-net and start moving this thing around so you can no longer black list the IP address of an authoritative name server. So on the one hand, that's kind of inconvenient. But on the other hand, you can imagine that there are not very many legitimately operated networks that perform this type of behavior. So in particular, what we can do then is look for cases where the infrastructure; in particular, the IP address of this stuff up here, the IP address of the authoritative name serve is moving around. And actually, this is work that Jaeyeon and I did with a student of mine several years ago. So what we did is we looked at the domains coming into our spam trap, and we repeatedly queried those domains and asked how often is the case that the authoritative name server for that domain is moving around. So this is just one result from that study. What we can see here, and this is basically a CDF of the inter arrival time between the changes at the authoritative name servers and the hierarchy, and you can see for the red line is basically the domains that are coming into our spam trap. You can see that in about half of those cases, the IP address at the authoritative name server is changing about once every six hours. So not something you expect to see on a legitimate network. And as you can see, it differs quite a lot from the legitimate domains. So 16 that's one type of DNS agility. Another, of course, is that the attackers, of course, can't continue to use the same domains, because the domain name itself is also going to end up on a black list eventually too. So they've got to continually register new domains. So that's inconvenient. But what we can do on the flip side there is look for what's different about these new domains. Well, to get you thinking about that, I can ask what happens if you register a domain. Who looks it up? Typically, nobody. When you first register a domain and no one's heard of it, no one's looking it up except for you. What happens when an attacker registers a new domain? Well, it might get enlisted as part of a scam campaign. It might be used for bot-net command and control. So we can use the initial look-up behavior to provide an early reputation for some of these newly registered domains. And that's what we did. So for this, of course, you need special data, if you will. You need a special vantage point. So we did this in collaboration with some folks from Veri-Sign, who have a nice view of the recursive resolvers looking up second-level domains in dot-cm and dot-net. We asked for those newly registered domains, who is looking them up within the first week of registration? And you can see, and by who, I mean how many distinct slash 24 networks. And you can see in this case, that in about 40 percent of the cases, the 40 percent of the those newly registered domains, there's something like several hundred unique slash 24 networks or more looking it up almost right away. And that essentially almost never happens with these legitimately registered domains. Okay. So now I want to talk a little bit about this second axiom, which is that the way that spammers actually send mail differs from the way that you and I tend to send mail. What we essentially did in this part of the work was come up with a supervised classifier based on supervised learning to distinguish spammers from legitimate senders. And the challenge here becomes how do you identify the features, the behavioral features that differ between legitimate senders and spammers? What I'm going to do is to show you a couple of highlights, because there are a bunch of features that tend to work well. A lot of them are kind of obvious and boring. So, for example, one of the things you can do is look at the ISP or the AS of the sender. And that tends to work pretty well. But I'll focus on a couple of the ones that are more interesting, because they tell us a little bit more about how, you know, how -- they provide a little insight in terms of how spammers tend to behave. 17 So one of them, what we did actually was take the source and I should mention the data that we used for this part of the study. This was work that we did in collaboration with McAfee, who has mail filtering appliances deployed in something like 8,000 different enterprise networks. These are globally distributed. So this is bias, of course, by where they've got their mail filtering appliances distributed, but this is just 0 to sort of also kind of paint a picture of an example of where the behavior may differ. So one of the things that we saw, for example, is that about 90 percent of the legitimate messages travel, you know, in a relatively close proximity. If you look at the spammer behavior, actually, it's significantly more, you know, more evenly distributed across distance. Another thing that we looked at, and this sort of comes back again to the fact that spam is being sent from compromised machines. We looked at how email is being sent from different regions of IP address space. So again, to sort of paint of intuition here, it's fairly unlikely that you would have a slash 24 network with 200 legitimate mail servers on it. Typically, you'd expect a handful, at most. On the other hand, what we were seeing in the cases of spam activity were these slash 24 networks or networks of such size where there would be 200 email senders in fairly close proximity. You know, in that particular slash 24. It makes sense, when you think about how spammers use the infrastructure to operate, right. They compromise a bunch of machines and then enlist them to start launching these types of campaigns. So what we can do is actually key off of that behavior to sort of design a feature, a behavioral feature that allows us to distinguish spammers from legitimate senders. In particular what we did in this case was to say when you see a piece of email sent, how far away an IP address space do you need to go before you see the K next nearest senders. For a particular value of K, you know, the smaller that IP address range or that space, IP space is, the more dense that sending activity is. And that's essentially what we see here. >>: So I'm having trouble understanding the graph with the intuition, because I've got to assume that Gmail, Hotmail and large companies have lots of -well, large companies have lots of outlook servers and Gmail and Hotmail have lots of IPs. What is being sampled here? 18 >> Nick Feamster: >>: So basically, the way to read this is like -- What's the data point? >> Nick Feamster: So yes. So the way to read this is how far -- this is how far out an IP address space do you need to go to observe the K closest email senders. So, for example if I take a particular IP address of a sender, I can say ->>: IP addresses, not emails? >> Nick Feamster: The data is IP addresses here. IP addresses of senders. So you could say, for example, in the case of spammers, to see the K closest -- to see the K closest email senders, I need to go out, I need to capture sort of 20,000 or so IP addresses surrounding that IP address. I've got to go basically to an order of magnitude more to see ten email senders if I start from a legitimate sender. Your question about web mail providers is an interesting one. I'd say that's the exception rather than the rule. There are only a few of those and there are a lot more email senders who are not those people. Because we're not talking about volumes here. We're talking about activity. >>: So this is -- >> Nick Feamster: >>: This is a mean. Why do I care about the mean? >> Nick Feamster: Because it's, I mean, there's a difference here that's clearly, that's represented. >>: Sure, but you can have a very high mean, but you can still have a bottom 10 percent that's a real problem. So if you have a lot of large company, or a reasonable number that have ten outlook servers, then as far as the ability to use this as a heuristic goes down dramatically. >> Nick Feamster: I would posit that the number of legitimately operated networks that have more than a handful of mail servers for several hundred IP 19 addresses is not that many. I mean, your typical enterprise network, which is basically what we're looking at here, because remember the data set we're looking at. We're looking at mail from enterprise networks, you know, where these spam filtering boxes have been deployed. We're not talking about Gmail or Hotmail or Yahoo in this case. But if you look at sort of the typical enterprise network or campus network or what have you, the type of place that's likely to deploy a filter of this type, you're not going to have dense email sending activity. Take the Georgia Tech campus, for example. You will not find a slash 24 network on that campus with 200 legitimate mail servers. That's essentially the intuition we're operating on here. Now, to your question about the mean, you're very right, there are going to be outliers to this, and you've actually identified one of them. So that is basically why, in the context of designing a supervised learning classifier, you can't rely on just one feature. So we don't look at means when we design the classifier. We look this particular feature as an input among many of the other features as well. So obviously, you're not going to get it right every time. Just like in the case of distance, you're not going to get that right every time either. So the point here is basically to point out a general trend that is true a lot of the time. >>: Why is certain spam higher than, like, other spam? >> Nick Feamster: I expect that's kind of a labeling problem. Or may be the case that there's a lot more, a lot less of one of those categories. The way that the data was labeled was actually sort of post hock manual, semi manual. And we didn't do that. That was actually labels that were given to us. Okay. So that's just to paint a picture of a couple of the features that we used. We used a whole bunch more that I don't have the time to talk about. Once we put those into a supervised learning classifier, the false positive rate we get if we basically look at the detection rate that something like spam house gets, we get about four-tenths of one percent false positive rate, which is still a bit too high to be practical. Most mail serves like to see about one-tenth of one percent. We can play other games, like white listing, AS's for which we get the most wrong answers, et cetera, et cetera, tune a bunch of knobs to get that down to about 0.14% or so. 20 But this is basically, I'm giving you the number that's basically just taking the features that we've identified and turning the crank. We can actually do pretty well. The features that I showed you, some of the features that I showed you also are used by McAfee, who we worked with on this particular project in the mail filter that they use in practice. Okay. So finally, just to talk about coordination a bit. So I'm glad you actually raised the point of web mail actually, because as you point out, this is actually totally changing the game. In particular, as you mentioned, right, that particular feature that we talked about may not apply. But in general, the types of features that we studied in that work may not apply, because a lot of them key off of IP addresses. But, in fact, IP black lists aren't going to work either, right. So there's an interesting thing that's going on, which is that now we can no longer use IP addresses or the types of behavior that I mentioned. Well, what can we use? We can use user input. We can use those like mark as spam button. Well, so what's the next step in that game? Well, so actually what we're seeing now is that spammers are sending mail to themselves. You might think why are they sending mail to themselves? It's so they can vote on their own messages. So they send mail to themselves and they basically vote, not spasm, on the messages that they see. So this is some work that we did with Yahoo. In particular, over the course of about three months, we saw about a million and a half not spam votes coming from accounts that basically did nothing but vote not spam on anything. So there's a fair amount of this activity going on. But the other kind of interesting thing about this is that it doesn't take that much. Because the cost of a false positive is so high, anything that basically gets a not spam vote really tweaks the weights. So what we want to, of course, is try to detect those fraudulent votes. And what we can do is actually make some observations about how those votes are being cast to try to distinguish like fraudulent not spam votes from the legitimate ones. This actually draws some inspiration from some work that's gone on here in terms of detecting compromised accounts through coordinated activity, in particular the bot graph work is quite similar to the observation that I'm 21 about to point out here. But what we can do, actually, is take this voting problem and model it as a bipartite graph where the, in gray here, we have the IP addresses of spammers and legitimate senders and these are IP addresses that are being voted on. And then we've got some user accounts that are actually casting the votes. So what we can see when we sort of create this graph of activity is a couple things pop out. One, of course, is that these compromised accounts tend to cast a lot more not spam votes than, like, a legitimate user typically would. But the other thing that really pops out is that spammer IP addresses, they actually tend to receive not spam votes from many different compromised accounts over here on the right side of the graph. There's a circularity here because how do I know it's compromised before I figure out that one of these guys is being voted on by a bunch of compromised accounts. But you can break that circularity by clustering. We can basically look at IP addresses over here that are being voted on and we can say are there groups of IP addresses here that are being voted on by a similar group of user identities or user accounts on this side and basically, we can create a cluster based on the sort of observing the similarity of voting behavior there. So that's effectively what we did. We applied sort of a graph-based clustering approach to tease apart the user identities that vote in a similar fashion across many different IP addresses. Nowish the approach that we started out with and the same approach that bot graph uses in their work to detect compromised accounts is you can build a K neighborhood graph. The idea there is basically you figure out instances of IP addresses for which a particular user identity votes in a same way at least K different times. And then you basically group user identities based on that value of K. The problem with that, actually, is that it can produce false positives. So if you've got a group of good guys all voting in the same way and you've identified some fraudulent voters here and you've got maybe one account in the good guys that has some strange behavior, either because it's coming through a proxy or maybe it happens to be compromised itself, something, then all of a sudden, you basically take this whole cluster of good guys and you sort of lump it in with the bad guys. 22 So the problem with just sort of applying just a straight K neighborhood clustering is that false positive -- it's hard to keep a handle on the false positives. What we can do to improve that approach is actually apply something called canopy clustering. I'm not an expert in this area so I won't speak too much to the details. But at a high level, what canopy clustering allows you to do is it allows you to apply clustering in two stages. You basically create these large, sort of larger groups of things to cluster on and then you reapply k-neighborhood clustering inside those things that are called canopies. That actually allows things to scale a lot better so that's important here in this case where there's a lot of mail and a lot of senders, but also you can keep a better handle on false positives. So this is actually something that we worked on with the folks at Yahoo and this is something they actually used to try to now detect these kind of fraudulent votes. >>: Couldn't you also break the symmetry by actually looking at messages and seeing whether they're spam? >> Nick Feamster: You could do that actually, I guess. So then effectively what you're doing is bringing human in the loop to sort of see ->>: Or other techniques to analyze what's spasm. >> Nick Feamster: Yeah, you could potentially do that as well. I think that's a totally reasonable approach. You do have to look at content and there's some cost to doing that, but that's probably doable. Or you could certainly do other things too, look at a spam score based on other features ahead or what have you. So you could look at content. That could probably work okay. Once you sort of get away from content, you get into this mess of now everyone's sending email from Gmail to Gmail and Hotmail to Gmail and Hotmail to Yahoo and so forth. So a lot of the features that we typically key on that relate to the IP address and headers and stuff, they no longer work so you kind of have to dig into content to really take the approach that you're taking, but I think it's reasonable. So just a time check, actually. you want me to sort of proceed? minutes to 20. Whatever you -- I know we have like a 90-minute slot so how do I mean, I can finish and anywhere from three 23 >> Jaeyeon Jung: You can continue with what you have. >> Nick Feamster: So what I'll do is I wanted to spend a few minutes talking about this other problem, which is maintaining openness, in particular enabling or sort of facilitating communication in the face of a censor who would wish to sort of disrupt this communication. So this is a problem, of course, that's sort of enjoying increased prominence and sort of at a high level, I'll describe to you what the problem is. Of course, Alice wants to talk to Bob. There's a censor in the middle who would either wish to block that traffic or, worse, potentially punish Alice for attempting to talk to Bob. So I think that's where things actually get a little bit different than kind of the conventional types of things that we've seen in this area, because not only do we want to allow Alice to talk to Bob, but we also want to potentially conceal the fact that she's trying to do so in the first place. The general approach to allowing this, to facilitating this communication is to use some type of helper where while Alice and Bob may not be allowed to talk directly to one another, Bob may be able to communicate with some kind of helper, and maybe Alice can talk to that helper as well. So the communication between Alice and that helper is somehow permitted. The idea, they, is to basically use this point of indirection to allow Alice and Bob to send each other messages. So the challenge here, and this is sort of a studied problem. The most famous helper, I think, is a mixed net called Tor. And but the challenge there is when using something like Tor it's fairly easy to hide what you're getting, and sometimes it can be easy to sort of break through censors using those techniques. If someone happens to be looking, if the censor happens to be looking at that traffic, it can be very hard to hide that you're doing, that you're actually performing that kind of activity. So one of the things that we've been doing over the years is actually trying to, try to design communication techniques that defeat censorship that are also deniable. In other words, that disguise the fact that Alice is using this kind of technique in the first place. 24 So what I'm going to do is actually talk about a particular system that we've designed to achieve that goal. There are a number of things we want to achieve in the design of the system. One, of course, is we want to thwart disruption. We want to make it difficult for the censor to disrupt the communication. In order to do that, we use a combination of sort of redundancy techniques and hiding. The other thing we want to do is make that activity, the act of Alice fetching that content or communicating with Bob, we want to make that look innocuous and there we'll steal some techniques from distributed systems. Finally, what we want to do is if a censor's watching the communication between Alice and Bob, we want to make it less obvious that they're talking to one another. What we want to do is decouple the sending and the receiving in messages. In the real world, what this might look like is I want to send you a message, but I know someone's watching and I want to make it not so obvious that I'm sending you a message. So what I do is I put the message in a paper bag under the bridge and I tell you to go look there at some later point and pick it up. So to someone who's observing, they may not notice any correlation between who's dropping off the message and who's picking it up. So with this system I'm going to just briefly describe to you is essentially paper bags under a bridge for web 2.0. Effectively, what we do is use user-generated content sites to allow Alice and Bob to communicate with one another. So what Bob is going to do in this case, Bob, we'll just say he's a Flickr user. Flickr itself may be blocked. That's fine. I'm using Flickr for the sake of example and also because it's what we built our prototype on. But this could be any site that hosts user-generated content. So what Bob is going to do is take his message. He's going to actually sort of embed it in some kind of content, whether it might be an image, a video, something like that. And he's going to post it on a year-generated consent site. Alice is basically going to retrieve that content, and to the censor, this is going to look like Alice is looking at videos of cats or, you know, vacation photos or something. When, in fact, what she's really interested in is that thing that's hiding inside the cover. 25 So let me just describe to you in just a little bit of detail how that works and then I'll dive into a couple of challenges and wrap up just probably five or ten more minutes. So Bob is going to take his message that he wants to send to Alice and let's assume, actually, that there's some message identifier that either Alice and Bob have agreed on or Alice already knows. So a message identifier might be, for example, an URL for the message that Bob is trying to put in might be the web page corresponding to that URL. Or if this is a particular message that Bob wants to communicate to Alice, the message ID is something they would have had to agree on magically somehow out of band. So there's a boot strapping step that I'm hand weighting over here. But let's assume that there's an identifier associated with that message. Bob is going to take that message. He's going to take his cover traffic, in this case maybe a picture. He's going to embed the message in that picture in that cover traffic. He's going to upload it to some user generated content site like the drop site, if you will, and Alice is basically just going to reverse this process to retrieve her message. So that's the high level picture, right. And there are a few challenges to making this work. One is figuring out how to embed the message, which actually I'm going to sort of skip over, because the techniques that we use here are fairly straightforward. We want to basically do things in such a way that it's hard for the censor to discover and also hard to disrupt. So we can use sort of standard image hiding techniques to make discovery difficult and we can use sort of more redundancy style techniques and redundancy and erasure coding techniques to make disruption tough. What I'm going to spend a little bit of time on is these latter two challenges. In particular, how does Bob figure out where he should put this thing. Like what content should he put it in? Where should he drop it so that Alice can find it? What we want to do is make the process of Alice fetching the cover traffic deniable, right. Something that she would do anyway so if the censor's watching, this would look just sort of normal. Okay. So where do embed this, right. Of course, Alice could go locking everywhere. She could just download all of Flickr and look at every picture and see is my message here, is it there? No, it's not there. This is not an option, right. For a variety of reasons. So Alice and Bob somehow have to agree on some subset of content 26 without immediately communicating with one another. They've got to do this, as I mentioned, in a way such that when Alice does this, it's deniable. So here's basically how we create that deniable embedding. What we do is we take these message identifiers, let's say an URL, right, and we put this basically into some ID space. What we want to do then is identify some kind of tasks that Alice would perform anyway. You pick some things this she would do, right, look at Bob's vacation photos or watch videos of cats or, you know, look at pictures of blue flowers or something. We put this in the ID space as well. What we do then is we sort of map the mental identifier that corresponds to that content to the tasks that Alice would need to perform to retrieve the cover traffic which contains that content. For example, these tasks might be something like, as I mentioned, right, search for blue flowers or look for a particular set of images or videos. By doing those particular things, which she's likely to do anyway, then she's able to get the stuff that she really cares about. Okay. So as you might imagine -- so that's basically the general idea. As you might imagine, this does not perform super quickly. This is basically good for things like publishing an article or sending a message. Depending on how deniable you want this to be and how aggressive Alice is at fetching all kinds of stuff or how quickly she does this, this can take on the order of minutes to grab a fairly small message. But presumably, that's good enough for certain types of communication. Now figure out how to make this type of communication more realtime and yet deniable remains an open challenge. I want to spend a couple minutes closing up, talking about this last challenge. >>: Does Bob have to know something about Alice's activities? >> Nick Feamster: Alice needs to know something about where Bob is going to put these things. The way they do that is you can kind of view this mapping of tasks to identifiers is like a dictionary of sorts. So the thing that they both agree on, which is where the bootstrapping has to occur is that common message identifier. So Bob is going to put stuff in a certain place based on that and Alice is going to fetch it, based on that. So that's sort of the common language that they have to speak in order for that 27 to work. So I just want to -- yes? >>: If they can bootstrap, why don't they use the same message mechanism to just exchange the message? >> Nick Feamster: Presumably, like the bootstrap mechanism is going to be smaller than the message itself. But you're right, you potentially need another mechanism to pass that bootstrapped information. And if you had a perfect bootstrapping mechanism, you wouldn't need such a system in the first place. But the idea here would be that you might not need to pass that over the network at all. So, for example, I could say hey, the message ID that I'm going to first, let's say that we meet in person, right, in a dark alley or something. And I say that the message ID that I'm going to first send you a message on is, like, 129. So now we're good, right. So based on that, now you can maybe fetch bigger messages. I could even send you a new set of message IDs or even a new mapping once we've got the initial bootstrap. I think that's the trick there is size. Okay. So just a couple of minutes talking about this last challenge, which I'll just pose, like, pose a position for you. I think we've seen in sort of recent times a lot of governments restricting communications in various ways. So, for example, we saw for example with the elections, the blocking of Twitter, et cetera, we saw the Egyptians completely shut down the network. I would posit that as governments get more savvy about how to use the network, they're not going to shut it down but rather use it to manipulate the information that we see. Or that citizens see. Because why would you shut it down if you could use it to sort of influence public opinion or tilt the outcome in your favor. So I think basically what I see as an ongoing challenge is manipulation of content and I'm going to talk briefly for a couple minutes about a more sort of benign version of manipulation. But something that I think is still relevant, and that's personalization. So in the best case, right, we're seeing many, many organizations take our activities, our preferences, et cetera, and use that information to sort of whittle down what we see. If we search for shoes, if we search for books or network, whatever, our past behavior activities, et cetera, dictate the types of things that we're likely to see, based on those interests. 28 Now, in the best case, right, we're seeing things where -- we're seeing results that potentially are already tweaked towards things that we already agree with or are already aligned with our own tastes and interests. And one could argue that's good. There are certainly positives to that. But there is a flip side to that as well, which is that we don't have control over how that's being done, and we actually don't even know sometimes what we're missing. So one of the things that I'm looking at now is how to provide users better visibility and control into how that sort of restriction or filtering is being done, if you will. I'm going to describe, this is very much a work in progress. I'm going to take like a minute or two to describe kind of the low-hanging fruit that we're doing here. So we've started off looking at search. In particular, when you search, your question IP, what are other people seeing when they search for the same term? Or you might actually want to run the same query in different ways, maybe at different personas or something like that. So we're starting off with something very simple, which is to say take a query, and then run it from different geographies as different users and basically see what turns up, what shows up on the first page. Where does it show up? What doesn't show up? When it doesn't appear, can we explain those things? Okay. So in particular, I had Ratul in here. I used the wrong slide. Well, let's use Arvind. So you can -- I was going to pick on retool, since he wasn't here. So I'll give you a different example. So if we search for Arvind, a professor at the University of Washington, you'll basically get a bunch of search results. But what this chrome program we've built does is also tell you other things that you didn't see. Other search results that came up in a particular user's top ten depending on where that search was run. So right now, we've basically built this tool, which is called bobble, that allows you to basically see search results from other types of perspectives. In particular, we're are basically just focused on geography so far with the tool. But as we work on this, we're expanding it to work on sort of how these queries differ based on your past search history and other types of contexts that you might imagine. So ultimately, this is basically just the first step, because we're looking at how to improve visibility. 29 Ultimately, though, you might imagine a tool where a user actually provides some feedback or control into the types of results they see. So an example of that might be, I might want to run a query as a particular persona or as a particular group. So, for example, the results that I see when I search, say, for example, Seattle, right, I might want the results to differ based on whether I'm querying as a food connoisseur or marathon runner or networking researcher or what have you. I might want to see different types of things. So that's the type of direction I see this work going. So just in conclusion, I've talked about how different parties are vying for control of information on the internet in a variety of different ways and I've spent probably the majority of the time talking about work that we've done to combat message abuse, but I've also talked about other problems related to both censorship and just more recently looking at how different tools and algorithms may ultimately be used to manipulate the types of information that we do or don't see, and I'm working on ways to sort of tilt the balance back in the hands of the user as well. So thank you very much. >> Jaeyeon Jung: Any quick questions? >>: So message filtering has been a long history of just cat and mouse and cat and mouse. Particularly the content-based filtering seems to have merely trained the population to get, you know, [indiscernible]. Is there grounds for optimism that are intrinsic thinks in the network level approaches that you described that would give us tools that are beyond the reach of their ability? >> Nick Feamster: Yeah, this is something that I think requires -- it little bit more formal study, but I would say informally, we'd like to features that are more costly forked a very assess to adapt to. If we content, for example, it's fairly low cost to change the way a message encoded or embedded. bears a look for look at gets If we look at network level features, adversaries can still adapt. In particular if we look at the behavioral features that I showed, you could certainly sense spam from less dense, you know, in a less dense way. But presumably, I mean, lurking behind that, we assume that there is some cost to 30 adaptation. >>: And -- On both sides? >> Nick Feamster: On both sides, right. To take that example, in particular, if you were to say that you can only send spam from, you know, a certain number of IP addresses within a range before you sort of trip some detectors, then presumably you've imposed some kind of cost, maybe in reduced volume, certainly for a region of space. I don't know how to model that cost. I think it would be a very interesting problem to sort of try to figure out, okay, now can we actually model an adversary. On the flip side, actually, I think there's like in the censor side, as well, you could ask the exact same question. It's like, yes, but how do we know that the censor isn't going to adapt to try to detect the techniques that I showed. And I think, again, there's like a really interesting question in trying to model the capabilities of the adversary. In theory, the detectors in that case is unbounded. It could do just about anything. It could look at the fact that you never send mail or you never browse the web at 3:00 a.m. on Sundays. >>: It's not an economic motivation so it's harder to say what level [indiscernible] go to greater lengths than the spammer [indiscernible]. >> Nick Feamster: Exactly, um-hmm. It is potentially a tougher thing to model, particularly if you think -- if you feel the government is having unbounded resources to throw at the problem, then it certainly is trickier. You could potentially do the economic model if you talk about, instead of a government with potentially unbounded resource, if you talk about, say, folks who are interest in DRM, for example if they have a certain amount -- I mean, there are economics at play in those kind of situations. >>: [inaudible]. >> Nick Feamster: Seemingly, yeah. But I think on either problem that you look at, there's sort of like more work to be done in terms of modeling the adversary. >> Jaeyeon Jung: Okay. Let's thank Nick one more time.