1

advertisement
1
>> Jaeyeon Jung: It's my great pleasure to introduce professor Nick Feamster.
Nick is a tenured associate professor at Georgia Tech, and he's been working on
a lot of interesting research problems related to network systems, including
network architecture, protocol design, security, management, measurement,
anti-censorship.
So to many of you, Nick is not a stranger, right? But to properly introduce
him, I downloaded his most recent CV, which is 43 pages long. So I spent some
time studying his CV. And so Nick has published nine journal papers, 39
referred conference papers, 27 workshop papers, and five best paper awards.
And he has received many prestigious awards including NSF presidential courier
award for scientists and engineers and the usual things like, you know, career
award, Sloan fellowship and IBM faculty award, whatnot.
You might
reference
in 2010.
office at
weekdays,
CV today,
think that oh, must be out there for many years, but to give you a
point, he was chosen as MIT technology review top innovators under 35
And to add a little bit of a personal touch, Nick and I shared an
MIT for many years, and seeing him working day and night, weekend,
I thought I was the biggest slacker in the world. After reading his
I feel the same way again. Oh, well.
So okay, without further ado, Nick.
>> Nick Feamster: Thanks for the introduction. I didn't know any of those
statistics. Great, so hopefully I can teach you some things today about what I
call the battle for control of online communication. This work, you'll see,
has several facets. And it's joint work with several of my students, as well
as some other faculty members at Georgia Tech who work in both areas of
security and machine learning.
So we'll see where I bring those elements of computer science into the work
that I do in various portions of the talk, and I'll try to point those out as
we go. Briefly, before I get into the technical part of the talk, I just want
to talk to you sort of broadly about what all those papers are kind of about,
and the general approach I take to my work.
So generally speaking, I perceive problems in an area that I call network
operations and security, and the idea there is basically like how do you design
2
tools, algorithms, techniques to help people run networks better and, in
particular, secure them better, make them perform better, make them easier to
troubleshoot when things go wrong, et cetera.
And the flavor of work that I tend to do, basically, draws inspiration for the
problems from domain knowledge. So I like to basically talk and sort of
interact extensively with people in the network operations community than other
folks in the trenches. For example, for those of you who are familiar with,
who have worked in networking, you've, of course, heard of the North American
network operators group. I talk with operators there to sort of identify
practical problems with hard underlying questions. There are other groups I
tend to work with as well, like the message abuse working group, as well as
campus network operators and so forth.
So I try to basically spend some time with people in the real world of networks
to try to understand what sort of practical problems exist that potentially
have some kind of hard underlying question. Those problems tend to be pretty
messy in practice, of course. So then basically what I aim to do with my
research is to try to abstract those problems and model them and reduce them to
problems that are easier for us to understand and model.
So and I'll present several examples of how we do that in the problems that
I'll discuss today. Then sort of having come up with some kind of -typically, in trying to divide solutions to these problems, the work that I do,
I basically try to draw on a number of techniques from other areas of computer
science, not just networking. So in particular, as I mentioned, security is
another area. But also machine learning as well, and I'll describe that in a
little bit more detail.
Having sort of come up with a solution, however, I find this not quite the end
of the story. What I then try to do with that solution is to try to engage
with industry and transfer what we've come up with on paper into practice in a
number of ways. In particular, again, I'll point out in various aspects of the
work where we've basically taken some of the algorithms and features that I've
identified in various contexts and worked with a bunch of folks in the industry
to transfer some of our findings into practice.
So that's kind of the general approach I take to problems and sort of broadly
speaking, there are a number of, if you sort of go to the trenches and sort of
figure out what, ask what kind of hard problems exist, there are a number of
3
hard problems that -- or practical problems with hard underlying questions that
I've pursued. In particular, this is the one that I'll focus on today, which
is like there's this balance between how the internet has been designed to be
open and that openness actually on the flip side makes it, you know, more prone
to attacks of various kinds and figuring out ways to sort of appropriately
balance that is sort of the general sort of high level theme to the talk.
There are a bunch of other problems that I've worked on with my students and if
there's time towards the end of the talk, which maybe there won't be, I will
elaborate on some of those as well.
So that's the hard question that I want to focus on today, and I just want to
briefly point out some of the areas that you'll see in today's talk. So I will
spend my time focusing a lot on spam filtering and message abuse. In that
work, we've basically drawn on some techniques from the theory community and
machine learning community to develop techniques to sort of combat problems in
that area.
Also, I'll spend some time talking about problems in anti-censorship and
maintaining availability in the face of adversaries who want to disrupt
communication and, again, I've sort of worked with folks across areas in that
domain. And I'll talk a little bit about some more recent work that we've been
doing with propaganda and filtering.
So I think the sort of powerful facet to this line of work is that there are a
lot of adversaries out there who would wish to sort of disrupt communication,
abuse the network for their own purposes, et cetera. And the good news is that
we can use tools from various aspects of computer science to in some sense
level the playing field.
Okay. So let me come back a little bit to this bigger question, like the sort
of tension between the openness of the internet design and how that openness
sort of makes it more vulnerable to attack. First, what do I mean by openness?
I think this quote from the director of the media lab, which is in the New York
Times not too long ago really defines sort of the ethos of the internet design.
Which is to say that everyone should be able to connect, to innovate, to
program without asking anyone's permission. There's no central control, and
the assets are widely distributed. There isn't one particular owner.
So that's good.
There are many good things about that, and the positive aspect
4
of
So
to
to
that is that this openness has catalyzed just a huge amount of innovation.
the number of users on the internet is growing. The internet is expanding
many, many different geographies. And, of course, we're seeing the ability
connect from all kinds of different devices.
The flip side, though, so what I want to talk about in this talk is the flip
side of that coin, which is how openness facilitates abuse and manipulation.
>>: Are you assuming a causative link where it actually may not exist? Are
you thinking of just the rise of [indiscernible] -- essentially, the pace of
that is fascinating and those platforms, by traditional measures, closed. But
they're still being adopted. The number of uses they're getting is humongous
and [indiscernible] cell phones or feature phones. You couldn't change them in
any way. And the adoption rates were way better than the internet.
So what is the role of openness here if those platforms are not open?
>> Nick Feamster: I think specifically here, we can think of openness being
the IP stack. So in particular, even though, you know, many aspects of those
platforms remain closed, if you're able to implement an IP stack, you can get a
device online. And I think that's, I guess, to focus the attention, that's
really what I'm talking about here.
>>:
But feature phones took off without an IP stack.
>> Nick Feamster:
>>:
Feature phones?
Like regular dumb phones.
They took off without an IP stack.
>> Nick Feamster: Sure. I wouldn't say like this is a necessary condition for
growth, right. I mean, there are plenty of technologies and platforms that
have grown without being open. I would say that in the case of the internet,
though, it's certainly acted as a catalyst. There may be other things that
have caused the growth as well. But certainly, I don't think you can argue
that it has hurt.
>>:
I could, but --
>> Nick Feamster:
talk about today.
Good, okay. Well, but so that's not really what I want to
I want to focus on the flip side of the coin, which is by
5
virtue of the fact that the internet is open, and specifically to what I was
just mentioning, the fact that pretty much anyone with an IP stack can connect
and start to send traffic, that facilitates abuse, manipulation of different
kinds.
And that's sort of the central question that I want to focus on today. I want
to talk about this tension in the context of a couple of problems. And the
first is in securing communications. In particular, I'll speak about message
abuse. So depending on the statistics that you choose to believe, anywhere
from something like 80 percent to 95 percent of email traffic is spam. While
you may not see it in your inbox, the network operators who are running those
services definitely do see it and they have to do something about it so that it
doesn't end up in your, right in front of your face.
So this remains sort of a potential continually vexing problem and I'll spend
probably the bigger balance of the talk talking about things that we've done in
that area to sort of combat that.
The second topic this I want to speak about is, you know, on the flip side,
maintaining openness. How do we ensure or how do we help parties communicate
in the face of organizations, countries, governments, et cetera who would wish
to block or disrupt that kind of communication. So you may or may not know
that something like 60 countries around the world control or censor internet
communications in some form. So this is a problem that's fairly pervasive for
citizens of many countries.
I won't spend too much time on this last topic, but this is something that's
become a recent interest of mine is in sort of the more subtle, one of the more
subtle aspects or facets of information control is not just the decision to
block or permit a certain type of communication, but rather there is the
potential to say manipulate what a particular user sees when they go searching
for a particular thing or when they read a particular piece of content or news
story or blog post, et cetera. And I will talk a little bit about some more
recent work that we've been doing to try to help maintain transparency so that
users can, hopefully, become more aware of those types of manipulations.
So let me just into the first topic of spam filtering. So as I mentioned
already, spam is certainly a nuisance. It's becoming less of a nuisance for
us, because we hardly see it. But just because you don't see it doesn't mean
it isn't there. It still remains about 95% of all email traffic and a
6
significant fraction of that traffic is of that spam is coming from forms that
-- creative forms, you might say.
I'll explain in just a little bit why that's relevant to this particular talk.
So the other thing, I guess, that's relevant is that a lot of this spam is
coming from compromised machines or networks of compromised machines are
commonly called bot-nuts. And on one hand that's a bit of a scourge, but on
the other hand we're going to be able to use that to our advantage when we talk
about how to separate the good from the bad here.
So the general approach to the problem of spam is let's filter it, right. This
is sort of the obvious thing, right. Obviously, you want to basically take the
unwanted traffic and tease it apart from the bad stuff. I'm sorry, you want to
take got stuff and tease it apart from the bad stuff.
The question, of course, is ->>:
You got it the right the first time.
>> Nick Feamster: Exactly. And the question, then, is what features best
differentiate the spam from the legitimate mail. There have been -- this is
not a new question. This has been studied since, essentially, the advent of
email. And there's a large body of work in various types of approaches to this
problem. I'll talk about a couple of existing approaches to this problem and
sort of where that leaves, why there's still some room for improvement even
given these techniques.
The first, and I'll go into this just briefly is in the incoming slides is
content-based filtering. So you can, for example, design a filter that looks
the content of a message; i.e., what's being said, and try to figure out based
on what's in the message whether or not this is something that the user is
going to want to see or not. The other thing you could do is you co sort of,
you know, if there's a mail server connecting to your receiving mail server,
you could look at the IP address of the sender and try to put that on a black
list. So you could develop a reputation for that IP address and say based on
the behavior of this IP address in the past, I think this is good or bad.
The approach that we take and the approach I'm going to focus on in this talk
is complementary, and that's basically to say that we cap also look at features
of behavior, right. So we can basically say not just what's being said and
7
who's sending it, but how is the message being sent in terms of what time of
day is it, what time of day is it being sent, what ISP is it coming from, what
other kind of behavioral kinds of patterns can we see just in the network
traffic that stand out. The intuition here is that spammers send fundamentally
act in ways that differ from the way you or I act, and we should be able to, if
we can identify what those features are and how they stand out, we can key off
of those to design filters as well.
So let me first talk a little bit about the other two, quickly talk about the
other two approaches and kind of where they leave some room for improvement.
So content-based filters, as I mentioned, look at what's being said, and one of
the things to realize about that, if you sort of talk to the operator of a
large mail service provider, for example folks we've talked to include Yahoo,
secure computing, et cetera, what they will tell you is there's something like
100,000 different ways of spelling Viagra. That's sort of just to illustrate
the point of how difficult this type of thing can be, right, and how asymmetric
the attack is, right.
So here are some examples of other ways that spammers use content to sort of
turn the battle in their favor. So they can take a message and embed it in a
PDF or an excel spreadsheet or an image or even an MP3, and on one side, it's
fairly easy for a spammer to embed a message in a new type of carrier, if you
will. On the flip side, the filter maintainers have to design ways to
understand, parse, extract, et cetera from different types of content. So this
is certainly something that email service providers spend a lot of time doing,
but it's definitely an aspect where the battle is a little bit tilted in the
spammer's favor due to the fact that, you know, it's relatively easy to evade
these content filter in comparison to sort of updating the capabilities of that
filter.
The second approach, as I mentioned, is you could take the IP addresses and put
it on -- assign a reputation to that IP address. If you look at your mail
headers, you would see something like what's called a received mail header. Of
course, if this message is coming from a spammer, you'd see a string of these
things in the mail header and a lot of them would be forged. But at least one
would think, I'll explain to you actually in a few slides why this is not
always the case, but one would think that in most cases, this IP address that's
completing a TCP-3 handshake with you is the IP address of someone you think it
is. And if I could -- if the recipient could keep track of the behavior of any
8
of those particular IP addressing that are connecting to the server, then they
can decide what they think of that particular IP address. Is this a likely
spammer or legitimate sender.
Now, that actually works pretty well. There are large organizations that have
done pretty well at maintaining these kind of black lists. But again, this is
a bit of a cat and mouse game. And the challenge here is that the IP addresses
of email senders are never the same on any two given days, shall we say.
One of the experiments that we did actually to study the behavior of these
senders is actually we set up what's called a spam trap or a spam honey pot, if
you will. It was a mail server with several domains that had no legitimate
email addresses. So what a typical mail server would do is basically just
reject any attempts to send to nonexistent email addresses. What we did in
this case was basically accept any connection attempt and said okay, thank you
very much. We will deliver that. In fact, we're delivering it to our spool,
but to no one in particular and then gathering statistics on who's talking to
us.
When we did that, we basically see that on any given day, there are about 10
percent of these senders coming from IP addresses that we haven't seen in the
past. So that sort of churn on the black hat side, if you will the bad side of
things and there are possible causes for that type of thing happening.
We can't necessarily attribute the cause of the churn to any one thing in
particular. But there are, you know, malicious reasons for IP addresses
changing on any given day. But I'll take the question in just a minute. But
there are also good reasons why you might see email from an IP address you've
never seen before.
So, for example, the renumbering of a mail server or someone just decides to
set up a mail server, you know, on any particular day that they hadn't been
operating in the past so there are good reasons for IP addresses to suddenly
start sending mail as well. So you can't just say let's just black list
everything that's new.
So coming back to the sort of goal of openness and the desire to keep your
false positives low, the ephemerality also becomes a problem.
>>:
You can't paint 95 percent of the internet black and still call it open.
9
>> Nick Feamster:
for improvement.
Exactly.
So that's essentially where this leaves some room
>>: So you started off saying that, and it's true that most of us don't see a
lot of spam because of [indiscernible] getting pretty good at these approaches.
Who actually does see the spam? And why are there mail servers not doing even
these basic thing, which would cut down most of the spam [indiscernible].
>> Nick Feamster: So these basic things actually turn out to cut about 80
percent of connection attempts. So you can basically take an IP black list and
still, I guess not so well known fact is that operational mail servers do drop
about 80 percent of the incoming connections, just based on things like IP
reputation.
So the dirty secret behind what I'm presenting to you here is actually that the
gains are somewhat incremental, because you're taking that 80 percent that is
already -- you're taking the 20 percent on which the early decision hasn't been
made and you're basically trying to crank that up a bit. So I think the answer
to your question is that, yeah, already this is being done to quite some
degree, and we're basically looking for other features, et cetera, that can
help us gain additional advantage.
>>: But then if no one's seeing the spam, then shouldn't the spam rates go
down to zero?
>> Nick Feamster: Well, the issue is that actually, there's still about 20
percent of the connections that do, you know that do get accepted. Now, of
that, then content filters, et cetera, get applied. So you may not be seeing
-- there's some fraction of that that you do see, it's actually quite small.
But then the other stuff that you never see also presents a problem as well,
because you've got to store it.
So there are operational challenges as well. Like once you've basically
decided to accept a message for delivery, you've got to do something with it.
And the more you can basically shave that down as well, the better off the
operators of the service are as well.
>>: A follow-up on the [indiscernible] question. Maybe some people do click
those [indiscernible] ads. So maybe my spam is not his spam. So as long as
10
those people exist, it may be hard to eliminate spam because some people do
want to receive ->> Nick Feamster: I think if no one were clicking on them, then obviously that
would be end of story, right. But there has to be some small fraction of
people who are actually buying the stuff, yeah.
>>: What was the scale of the gathering here? 10 percent new every day, it's
hard to imagine that Hotmail and Gmail are seeing that level of new IP
addresses. How long was your time window and kind of what did you do to
advertise these?
>> Nick Feamster: So this is like, so advertising is a tricky thing, actually.
We basically, as it turns out, a lot of the baiting seems to come from who is
scraping, right, looking at newly registered domains because we did actually
put the domains out there. And to not much effect for a while, actually.
Basically, so what was the question, the scale?
>>:
Did it actually plateau?
>> Nick Feamster: So we did this over the course of about four months. So
yeah, you're right that eventually it's bound to plateau because there are a
limited number of IP addresses out there.
>>:
That was your plateau rate, right?
>> Nick Feamster:
>>:
Yeah, that was our plateau rate.
So Gmail would see more?
>> Nick Feamster: Eventually, they'd have to because they're going to see a
lot more on any given day. In fact, maybe they plateau after a day. Yeah.
Okay. So that's basically where the state of the art is and where there's some
room for improvement.
So as I mentioned, this is essentially the approach that we're taking. And if
you're going to talk about looking at network level behavior to try to do
detection, the obvious question then is, well, what's different about the
spammers, right. So sounds nice, right. Intuitively, spammers aren't like us.
11
Presumably, they should behave differently. But what exactly is it about them
that looks different?
And what I'm going to talk about is three different ways we've observed
spammers behaving differently from legitimate senders. And intuitively they
make sense and I'm going to try to drill down into each of these and show how
we've taken these kind of axioms, if you will, and derived more features, more
low-level features and detection methods based on this intuition.
The first is what I call agility. The idea here is that spammers actually have
to move to escape detection. So if spammers always sent from the same IP
addresses, sent email from the same IP address, if they always hosted their
pill sites and phishing sites at the say URLs or the same domains, it set, then
eventually all those places would end up on black lists or shut down, and
things wouldn't work so well.
So spammers actually have to move around to escape detection. So on the one
hand that's kind of inconvenient, right, because you create this cat and mouse
game where you continually have to, with the techniques I've already described,
you continually have to update black lists and so forth to keep up with that.
But on the flip side, what we can do is actually recognize that the way that
spammers change where they're doing things, where they're performing their
activities, differs from the way that anything else on the internet changes.
And I'm going to basically use that intuition to show you some particular
features that really stand out in terms of the spammer paver.
And the second is that spammers just send mail in ways that you and I don't.
So just in terms of the way that they send messages to people just look a lot
different. And the other sort of keys, the last one sort of keys off the idea
or the observation that a lot of spam is coming from these bot-nets. These
networks of compromised machines. As a result of that, what we can see is some
coordination that wouldn't otherwise pop out from groups of legitimate senders.
The obvious thing here, right, and one of the things that we study is that
sending behavior actually exhibits some coordination. But I'll show you
actually another pretty cool and interesting example of coordination that also
popped out as well when we get to that part of the talk.
So I'm going to first talk about agility. In particular, I'll talk about how
spammers have used various internet protocols to move around. Then I'll talk
12
about how we've -- different aspects of spammer behavior that look different
from legitimate senders and how we built supervised learning classifiers on top
of that to help differentiate spammer behavior from that of legitimate senders
and then I'll talk about some of the behaviors that tend to cluster well.
Just to sort of whet your appetite here, one of the coordination behaviors
actually that we'll key off of is that spammers actually send mail to
themselves so think about that for a while and then I'll come back to it.
So let's first talk about agility. One of the things that we observed, and
this comes back to the data collection method that I spoke about before. So
basically, what we could do is set up spam honey pot if you will, or spam trap
and see who is sending us messages. The other thing we can do is sort of join
that with our view of what the internet routing table looks like at any
particular point. And then ask is there any kind of correlation there between
the two things that we observe.
So here's something that we see, and I'll point these things out as I walk
through this example. So when we look at the internet routing table, one of
the things that we actually is an advertisement for an IP prefix that lasted
only about ten minutes. For those of you who don't know what BGP is, by the
way, I should have just mentioned. It stands for border Gateway protocol. And
this is the language that ISPs use to talk to one another to advertise
reachability to a range of IP addresses. So they can say, hey, I know how to
reach this range of IP addresses. Please send your traffic through me to get
there.
Okay so what we see here between this red dot and the blue dot, the red dot is
an example of an IPs saying coming through me to reach the set of IP addresses
and the blue dot is like ten minutes later, you see a retraction of that
statement. It's called a withdrawal message. And this is already looking kind
of weird, right. Because you see a range of IP addresses that's advertised for
a very short period of time. When we run networks, typically we like to have
our network up for more than ten minutes, right? So already this is kind of
looking kind of a little bit strange.
Then the next thing we saw is if we sort of look at what's happening in that
range of IP addresses, in terms of who's trying to talk to us, we saw something
kind of interesting. We saw in this case five different -- in this particular
episode, five different IP addresses contained within that part of the network
13
that are talking to us, like sending us spam.
So this is pretty strange, right? You have short-listened network where the
reachability is extremely short-lived. And then inside that ten-minute window,
you see some activity.
Now, that's weird enough in and of itself. But then if I were to ask you if
you were to steal a region of IP address space, would you steal a big region or
a small region? And we thought really, you know, probably you'd steal a small
region of IP addresses because smaller, people less likely to notice. In fact,
actually, this particular behavior we observed the opposite.
So we actually saw these short-lived announcements popping up for, like, huge
regions of IP addresses, slash eights or about 1/256 of the entire internet
address space was being advertised in this sort of short-lived way. And you're
thinking like, what the heck, isn't someone going to notice if someone like
steals 1/256 of the internet?
Well, this is kind of brilliant attack-wise, right, because of the way internet
routing works, we know for those of you who know how it works, we know that it
works on what's called longest prefix match. The idea is that if there's some
network that's advertising a more specific range of addresses then that's
always going to win. The routers will continue to forward to the guy who's
advertised the more specific space.
Meanwhile, the attacker who has grabbed this sort of less specific space
suddenly got a huge chunk of addresses that isn't likely to be filtered because
it's a big chunk. Short prefixes tend not to be filtered. These things are
actually allocated as well so it doesn't look too fishy. At least no fishier
than it might otherwise look and they get a huge chunk of addresses. Anything
that's not being advertised by some other existing network suddenly basically
they've owned.
So that tends to be convenient.
>>: So aren't these attackers sophisticated enough to look at the BGP table
from route or somewhere else and pick out more specific blocks that are not
being advertised and just advertise those?
>> Nick Feamster: I think they probably could. It seems like that might work.
One thing, though, like one reason that might not work is sometimes, operators
14
do set up
shorter.
actually,
well. It
>>:
their filters to filter more specific regions if they turn out to be
So that might be one reason why it might not work. But it's
I wouldn't say that it's -- I can't say that it's not happening as
may be happening as well. Good question.
So seems the study was done six years ago.
Has the situation changed?
>> Nick Feamster: So we most recently looked at this last year as well. So
there's still a significant number of short-lived IP addresses that are also
sourcing attack traffic of different kind, both spam and other types of scans.
So I haven't looked at it in the last year, but this is like behavior persisted
up until about a year ago.
>>:
So how do you propose to defend against this?
>> Nick Feamster: That's a tricky question, actually. It touches a little bit
on the later part of the talk. I could spend quite a bit of time talking about
the philosophy here. I mean, obvious thing to do is to sort of design better
filters or, I'm sorry, be more vigilant about updating the filters. Of course,
we know that that's tricky. But that would be kind of the ideal situation.
Another thing you could do is talk about something like secure BGP, right,
where you can't actually advertise a prefix without it actually being assigned
and attributed to a particular network.
I actually think that there are some stones unturned there as far as like that
isn't necessarily a panacea. Then basically who's deciding who is allowed to
announce what? I think that doesn't necessarily solve things. You potentially
create a situation where things are less open.
So I think the right answer, which is probably unattainable, is have operators
be more vigilant about updating their filters. But maybe there's a better
answer to that that we just haven't thought of yet.
Okay so that's one example. Another thing that I mentioned that I was going to
talk about was when spammers send messages. They obviously want someone to
click on something and buy something. In order to do that, they need to host a
site somewhere. They need to host the website that's the Canadian pharmacy or
the fishing site or what have you. And the way that, the problem with just
hosting a site in any particular place is that if you leave it in that place
for too long, the infrastructure will be black listed and shut down.
15
So what attackers say about that, or what they do in response to that is
actually use the naming infrastructure to move their infrastructure around. So
this is a picture taken from the honey net project. And basically there are
two ways that this can be done. One is that you can just use the DNS in the
normal load balancing kind of way, when a client looks up a particular domain,
you can return different IP addresses. So you can change the IP addresses in
the A-record. And that's often called single-flux. It's kind of like a black
hat load balancing.
Now, the problem with that approach is actually that this thing right here,
what's called the authoritative name server for that particular domain isn't
moving around as well. So if I were to try to identify what's going on here
and shut things down, I could black list this authoritative name server.
What the attackers do in response to that is just take this thing and move it
on to a bot-net and start moving this thing around so you can no longer black
list the IP address of an authoritative name server.
So on the one hand, that's kind of inconvenient. But on the other hand, you
can imagine that there are not very many legitimately operated networks that
perform this type of behavior. So in particular, what we can do then is look
for cases where the infrastructure; in particular, the IP address of this stuff
up here, the IP address of the authoritative name serve is moving around. And
actually, this is work that Jaeyeon and I did with a student of mine several
years ago.
So what we did is we looked at the domains coming into our spam trap, and we
repeatedly queried those domains and asked how often is the case that the
authoritative name server for that domain is moving around. So this is just
one result from that study.
What we can see here, and this is basically a CDF of the inter arrival time
between the changes at the authoritative name servers and the hierarchy, and
you can see for the red line is basically the domains that are coming into our
spam trap. You can see that in about half of those cases, the IP address at
the authoritative name server is changing about once every six hours. So not
something you expect to see on a legitimate network.
And as you can see, it differs quite a lot from the legitimate domains.
So
16
that's one type of DNS agility. Another, of course, is that the attackers, of
course, can't continue to use the same domains, because the domain name itself
is also going to end up on a black list eventually too. So they've got to
continually register new domains. So that's inconvenient. But what we can do
on the flip side there is look for what's different about these new domains.
Well, to get you thinking about that, I can ask what happens if you register a
domain. Who looks it up? Typically, nobody. When you first register a domain
and no one's heard of it, no one's looking it up except for you. What happens
when an attacker registers a new domain? Well, it might get enlisted as part
of a scam campaign. It might be used for bot-net command and control. So we
can use the initial look-up behavior to provide an early reputation for some of
these newly registered domains. And that's what we did.
So for this, of course, you need special data, if you will. You need a special
vantage point. So we did this in collaboration with some folks from Veri-Sign,
who have a nice view of the recursive resolvers looking up second-level domains
in dot-cm and dot-net. We asked for those newly registered domains, who is
looking them up within the first week of registration? And you can see, and by
who, I mean how many distinct slash 24 networks. And you can see in this case,
that in about 40 percent of the cases, the 40 percent of the those newly
registered domains, there's something like several hundred unique slash 24
networks or more looking it up almost right away. And that essentially almost
never happens with these legitimately registered domains.
Okay. So now I want to talk a little bit about this second axiom, which is
that the way that spammers actually send mail differs from the way that you and
I tend to send mail. What we essentially did in this part of the work was come
up with a supervised classifier based on supervised learning to distinguish
spammers from legitimate senders. And the challenge here becomes how do you
identify the features, the behavioral features that differ between legitimate
senders and spammers?
What I'm going to do is to show you a couple of highlights, because there are a
bunch of features that tend to work well. A lot of them are kind of obvious
and boring. So, for example, one of the things you can do is look at the ISP
or the AS of the sender. And that tends to work pretty well. But I'll focus
on a couple of the ones that are more interesting, because they tell us a
little bit more about how, you know, how -- they provide a little insight in
terms of how spammers tend to behave.
17
So one of them, what we did actually was take the source and I should mention
the data that we used for this part of the study. This was work that we did in
collaboration with McAfee, who has mail filtering appliances deployed in
something like 8,000 different enterprise networks. These are globally
distributed. So this is bias, of course, by where they've got their mail
filtering appliances distributed, but this is just 0 to sort of also kind of
paint a picture of an example of where the behavior may differ.
So one of the things that we saw, for example, is that about 90 percent of the
legitimate messages travel, you know, in a relatively close proximity. If you
look at the spammer behavior, actually, it's significantly more, you know, more
evenly distributed across distance.
Another thing that we looked at, and this sort of comes back again to the fact
that spam is being sent from compromised machines. We looked at how email is
being sent from different regions of IP address space. So again, to sort of
paint of intuition here, it's fairly unlikely that you would have a slash 24
network with 200 legitimate mail servers on it. Typically, you'd expect a
handful, at most. On the other hand, what we were seeing in the cases of spam
activity were these slash 24 networks or networks of such size where there
would be 200 email senders in fairly close proximity. You know, in that
particular slash 24.
It makes sense, when you think about how spammers use the infrastructure to
operate, right. They compromise a bunch of machines and then enlist them to
start launching these types of campaigns.
So what we can do is actually key off of that behavior to sort of design a
feature, a behavioral feature that allows us to distinguish spammers from
legitimate senders. In particular what we did in this case was to say when you
see a piece of email sent, how far away an IP address space do you need to go
before you see the K next nearest senders. For a particular value of K, you
know, the smaller that IP address range or that space, IP space is, the more
dense that sending activity is. And that's essentially what we see here.
>>: So I'm having trouble understanding the graph with the intuition, because
I've got to assume that Gmail, Hotmail and large companies have lots of -well, large companies have lots of outlook servers and Gmail and Hotmail have
lots of IPs. What is being sampled here?
18
>> Nick Feamster:
>>:
So basically, the way to read this is like --
What's the data point?
>> Nick Feamster: So yes. So the way to read this is how far -- this is how
far out an IP address space do you need to go to observe the K closest email
senders. So, for example if I take a particular IP address of a sender, I can
say ->>:
IP addresses, not emails?
>> Nick Feamster: The data is IP addresses here. IP addresses of senders. So
you could say, for example, in the case of spammers, to see the K closest -- to
see the K closest email senders, I need to go out, I need to capture sort of
20,000 or so IP addresses surrounding that IP address. I've got to go
basically to an order of magnitude more to see ten email senders if I start
from a legitimate sender.
Your question about web mail providers is an interesting one. I'd say that's
the exception rather than the rule. There are only a few of those and there
are a lot more email senders who are not those people. Because we're not
talking about volumes here. We're talking about activity.
>>:
So this is --
>> Nick Feamster:
>>:
This is a mean.
Why do I care about the mean?
>> Nick Feamster: Because it's, I mean, there's a difference here that's
clearly, that's represented.
>>: Sure, but you can have a very high mean, but you can still have a bottom
10 percent that's a real problem. So if you have a lot of large company, or a
reasonable number that have ten outlook servers, then as far as the ability to
use this as a heuristic goes down dramatically.
>> Nick Feamster: I would posit that the number of legitimately operated
networks that have more than a handful of mail servers for several hundred IP
19
addresses is not that many. I mean, your typical enterprise network, which is
basically what we're looking at here, because remember the data set we're
looking at. We're looking at mail from enterprise networks, you know, where
these spam filtering boxes have been deployed. We're not talking about Gmail
or Hotmail or Yahoo in this case. But if you look at sort of the typical
enterprise network or campus network or what have you, the type of place that's
likely to deploy a filter of this type, you're not going to have dense email
sending activity.
Take the Georgia Tech campus, for example. You will not find a slash 24
network on that campus with 200 legitimate mail servers. That's essentially
the intuition we're operating on here. Now, to your question about the mean,
you're very right, there are going to be outliers to this, and you've actually
identified one of them. So that is basically why, in the context of designing
a supervised learning classifier, you can't rely on just one feature.
So we don't look at means when we design the classifier. We look this
particular feature as an input among many of the other features as well. So
obviously, you're not going to get it right every time. Just like in the case
of distance, you're not going to get that right every time either.
So the point here is basically to point out a general trend that is true a lot
of the time.
>>:
Why is certain spam higher than, like, other spam?
>> Nick Feamster: I expect that's kind of a labeling problem. Or may be the
case that there's a lot more, a lot less of one of those categories. The way
that the data was labeled was actually sort of post hock manual, semi manual.
And we didn't do that. That was actually labels that were given to us.
Okay. So that's just to paint a picture of a couple of the features that we
used. We used a whole bunch more that I don't have the time to talk about.
Once we put those into a supervised learning classifier, the false positive
rate we get if we basically look at the detection rate that something like spam
house gets, we get about four-tenths of one percent false positive rate, which
is still a bit too high to be practical. Most mail serves like to see about
one-tenth of one percent. We can play other games, like white listing, AS's
for which we get the most wrong answers, et cetera, et cetera, tune a bunch of
knobs to get that down to about 0.14% or so.
20
But this is basically, I'm giving you the number that's basically just taking
the features that we've identified and turning the crank. We can actually do
pretty well.
The features that I showed you, some of the features that I showed you also are
used by McAfee, who we worked with on this particular project in the mail
filter that they use in practice.
Okay. So finally, just to talk about coordination a bit. So I'm glad you
actually raised the point of web mail actually, because as you point out, this
is actually totally changing the game. In particular, as you mentioned, right,
that particular feature that we talked about may not apply. But in general,
the types of features that we studied in that work may not apply, because a lot
of them key off of IP addresses. But, in fact, IP black lists aren't going to
work either, right.
So there's an interesting thing that's going on, which is that now we can no
longer use IP addresses or the types of behavior that I mentioned. Well, what
can we use? We can use user input. We can use those like mark as spam button.
Well, so what's the next step in that game?
Well, so actually what we're seeing now is that spammers are sending mail to
themselves. You might think why are they sending mail to themselves? It's so
they can vote on their own messages. So they send mail to themselves and they
basically vote, not spasm, on the messages that they see. So this is some work
that we did with Yahoo. In particular, over the course of about three months,
we saw about a million and a half not spam votes coming from accounts that
basically did nothing but vote not spam on anything.
So there's a fair amount of this activity going on. But the other kind of
interesting thing about this is that it doesn't take that much. Because the
cost of a false positive is so high, anything that basically gets a not spam
vote really tweaks the weights. So what we want to, of course, is try to
detect those fraudulent votes. And what we can do is actually make some
observations about how those votes are being cast to try to distinguish like
fraudulent not spam votes from the legitimate ones.
This actually draws some inspiration from some work that's gone on here in
terms of detecting compromised accounts through coordinated activity, in
particular the bot graph work is quite similar to the observation that I'm
21
about to point out here.
But what we can do, actually, is take this voting problem and model it as a
bipartite graph where the, in gray here, we have the IP addresses of spammers
and legitimate senders and these are IP addresses that are being voted on. And
then we've got some user accounts that are actually casting the votes.
So what we can see when we sort of create this graph of activity is a couple
things pop out. One, of course, is that these compromised accounts tend to
cast a lot more not spam votes than, like, a legitimate user typically would.
But the other thing that really pops out is that spammer IP addresses, they
actually tend to receive not spam votes from many different compromised
accounts over here on the right side of the graph.
There's a circularity here because how do I know it's compromised before I
figure out that one of these guys is being voted on by a bunch of compromised
accounts. But you can break that circularity by clustering. We can basically
look at IP addresses over here that are being voted on and we can say are there
groups of IP addresses here that are being voted on by a similar group of user
identities or user accounts on this side and basically, we can create a cluster
based on the sort of observing the similarity of voting behavior there. So
that's effectively what we did. We applied sort of a graph-based clustering
approach to tease apart the user identities that vote in a similar fashion
across many different IP addresses.
Nowish the approach that we started out with and the same approach that bot
graph uses in their work to detect compromised accounts is you can build a K
neighborhood graph. The idea there is basically you figure out instances of IP
addresses for which a particular user identity votes in a same way at least K
different times. And then you basically group user identities based on that
value of K.
The problem with that, actually, is that it can produce false positives. So if
you've got a group of good guys all voting in the same way and you've
identified some fraudulent voters here and you've got maybe one account in the
good guys that has some strange behavior, either because it's coming through a
proxy or maybe it happens to be compromised itself, something, then all of a
sudden, you basically take this whole cluster of good guys and you sort of lump
it in with the bad guys.
22
So the problem with just sort of applying just a straight K neighborhood
clustering is that false positive -- it's hard to keep a handle on the false
positives. What we can do to improve that approach is actually apply something
called canopy clustering. I'm not an expert in this area so I won't speak too
much to the details. But at a high level, what canopy clustering allows you to
do is it allows you to apply clustering in two stages.
You basically create these large, sort of larger groups of things to cluster on
and then you reapply k-neighborhood clustering inside those things that are
called canopies. That actually allows things to scale a lot better so that's
important here in this case where there's a lot of mail and a lot of senders,
but also you can keep a better handle on false positives. So this is actually
something that we worked on with the folks at Yahoo and this is something they
actually used to try to now detect these kind of fraudulent votes.
>>: Couldn't you also break the symmetry by actually looking at messages and
seeing whether they're spam?
>> Nick Feamster: You could do that actually, I guess. So then effectively
what you're doing is bringing human in the loop to sort of see ->>:
Or other techniques to analyze what's spasm.
>> Nick Feamster: Yeah, you could potentially do that as well. I think that's
a totally reasonable approach. You do have to look at content and there's some
cost to doing that, but that's probably doable. Or you could certainly do
other things too, look at a spam score based on other features ahead or what
have you.
So you could look at content. That could probably work okay. Once you sort of
get away from content, you get into this mess of now everyone's sending email
from Gmail to Gmail and Hotmail to Gmail and Hotmail to Yahoo and so forth. So
a lot of the features that we typically key on that relate to the IP address
and headers and stuff, they no longer work so you kind of have to dig into
content to really take the approach that you're taking, but I think it's
reasonable.
So just a time check, actually.
you want me to sort of proceed?
minutes to 20. Whatever you --
I know we have like a 90-minute slot so how do
I mean, I can finish and anywhere from three
23
>> Jaeyeon Jung:
You can continue with what you have.
>> Nick Feamster: So what I'll do is I wanted to spend a few minutes talking
about this other problem, which is maintaining openness, in particular enabling
or sort of facilitating communication in the face of a censor who would wish to
sort of disrupt this communication. So this is a problem, of course, that's
sort of enjoying increased prominence and sort of at a high level, I'll
describe to you what the problem is.
Of course, Alice wants to talk to Bob. There's a censor in the middle who
would either wish to block that traffic or, worse, potentially punish Alice for
attempting to talk to Bob. So I think that's where things actually get a
little bit different than kind of the conventional types of things that we've
seen in this area, because not only do we want to allow Alice to talk to Bob,
but we also want to potentially conceal the fact that she's trying to do so in
the first place.
The general approach to allowing this, to facilitating this communication is to
use some type of helper where while Alice and Bob may not be allowed to talk
directly to one another, Bob may be able to communicate with some kind of
helper, and maybe Alice can talk to that helper as well. So the communication
between Alice and that helper is somehow permitted.
The idea, they, is to basically use this point of indirection to allow Alice
and Bob to send each other messages.
So the challenge here, and this is sort of a studied problem. The most famous
helper, I think, is a mixed net called Tor. And but the challenge there is
when using something like Tor it's fairly easy to hide what you're getting, and
sometimes it can be easy to sort of break through censors using those
techniques.
If someone happens to be looking, if the censor happens to be looking at that
traffic, it can be very hard to hide that you're doing, that you're actually
performing that kind of activity. So one of the things that we've been doing
over the years is actually trying to, try to design communication techniques
that defeat censorship that are also deniable. In other words, that disguise
the fact that Alice is using this kind of technique in the first place.
24
So what I'm going to do is actually talk about a particular system that we've
designed to achieve that goal. There are a number of things we want to achieve
in the design of the system. One, of course, is we want to thwart disruption.
We want to make it difficult for the censor to disrupt the communication. In
order to do that, we use a combination of sort of redundancy techniques and
hiding.
The other thing we want to do is make that activity, the act of Alice fetching
that content or communicating with Bob, we want to make that look innocuous and
there we'll steal some techniques from distributed systems. Finally, what we
want to do is if a censor's watching the communication between Alice and Bob,
we want to make it less obvious that they're talking to one another. What we
want to do is decouple the sending and the receiving in messages.
In the real world, what this might look like is I want to send you a message,
but I know someone's watching and I want to make it not so obvious that I'm
sending you a message. So what I do is I put the message in a paper bag under
the bridge and I tell you to go look there at some later point and pick it up.
So to someone who's observing, they may not notice any correlation between
who's dropping off the message and who's picking it up.
So with this system I'm going to just briefly describe to you is essentially
paper bags under a bridge for web 2.0. Effectively, what we do is use
user-generated content sites to allow Alice and Bob to communicate with one
another.
So what Bob is going to do in this case, Bob, we'll just say he's a Flickr
user. Flickr itself may be blocked. That's fine. I'm using Flickr for the
sake of example and also because it's what we built our prototype on. But this
could be any site that hosts user-generated content.
So what Bob is going to do is take his message. He's going to actually sort of
embed it in some kind of content, whether it might be an image, a video,
something like that. And he's going to post it on a year-generated consent
site. Alice is basically going to retrieve that content, and to the censor,
this is going to look like Alice is looking at videos of cats or, you know,
vacation photos or something. When, in fact, what she's really interested in
is that thing that's hiding inside the cover.
25
So let me just describe to you in just a little bit of detail how that works
and then I'll dive into a couple of challenges and wrap up just probably five
or ten more minutes.
So Bob is going to take his message that he wants to send to Alice and let's
assume, actually, that there's some message identifier that either Alice and
Bob have agreed on or Alice already knows. So a message identifier might be,
for example, an URL for the message that Bob is trying to put in might be the
web page corresponding to that URL. Or if this is a particular message that
Bob wants to communicate to Alice, the message ID is something they would have
had to agree on magically somehow out of band. So there's a boot strapping
step that I'm hand weighting over here.
But let's assume that there's an identifier associated with that message. Bob
is going to take that message. He's going to take his cover traffic, in this
case maybe a picture. He's going to embed the message in that picture in that
cover traffic. He's going to upload it to some user generated content site
like the drop site, if you will, and Alice is basically just going to reverse
this process to retrieve her message. So that's the high level picture, right.
And there are a few challenges to making this work. One is figuring out how to
embed the message, which actually I'm going to sort of skip over, because the
techniques that we use here are fairly straightforward. We want to basically
do things in such a way that it's hard for the censor to discover and also hard
to disrupt.
So we can use sort of standard image hiding techniques to make discovery
difficult and we can use sort of more redundancy style techniques and
redundancy and erasure coding techniques to make disruption tough.
What I'm going to spend a little bit of time on is these latter two challenges.
In particular, how does Bob figure out where he should put this thing. Like
what content should he put it in? Where should he drop it so that Alice can
find it? What we want to do is make the process of Alice fetching the cover
traffic deniable, right. Something that she would do anyway so if the censor's
watching, this would look just sort of normal. Okay. So where do embed this,
right. Of course, Alice could go locking everywhere. She could just download
all of Flickr and look at every picture and see is my message here, is it
there? No, it's not there. This is not an option, right. For a variety of
reasons. So Alice and Bob somehow have to agree on some subset of content
26
without immediately communicating with one another. They've got to do this, as
I mentioned, in a way such that when Alice does this, it's deniable.
So here's basically how we create that deniable embedding. What we do is we
take these message identifiers, let's say an URL, right, and we put this
basically into some ID space. What we want to do then is identify some kind of
tasks that Alice would perform anyway. You pick some things this she would do,
right, look at Bob's vacation photos or watch videos of cats or, you know, look
at pictures of blue flowers or something. We put this in the ID space as well.
What we do then is we sort of map the mental identifier that corresponds to
that content to the tasks that Alice would need to perform to retrieve the
cover traffic which contains that content. For example, these tasks might be
something like, as I mentioned, right, search for blue flowers or look for a
particular set of images or videos.
By doing those particular things, which she's likely to do anyway, then she's
able to get the stuff that she really cares about. Okay.
So as you might imagine -- so that's basically the general idea.
As you might imagine, this does not perform super quickly. This is basically
good for things like publishing an article or sending a message. Depending on
how deniable you want this to be and how aggressive Alice is at fetching all
kinds of stuff or how quickly she does this, this can take on the order of
minutes to grab a fairly small message. But presumably, that's good enough for
certain types of communication.
Now figure out how to make this type of communication more realtime and yet
deniable remains an open challenge. I want to spend a couple minutes closing
up, talking about this last challenge.
>>:
Does Bob have to know something about Alice's activities?
>> Nick Feamster: Alice needs to know something about where Bob is going to
put these things. The way they do that is you can kind of view this mapping of
tasks to identifiers is like a dictionary of sorts. So the thing that they
both agree on, which is where the bootstrapping has to occur is that common
message identifier. So Bob is going to put stuff in a certain place based on
that and Alice is going to fetch it, based on that.
So that's sort of the common language that they have to speak in order for that
27
to work.
So I just want to -- yes?
>>: If they can bootstrap, why don't they use the same message mechanism to
just exchange the message?
>> Nick Feamster: Presumably, like the bootstrap mechanism is going to be
smaller than the message itself. But you're right, you potentially need
another mechanism to pass that bootstrapped information. And if you had a
perfect bootstrapping mechanism, you wouldn't need such a system in the first
place. But the idea here would be that you might not need to pass that over
the network at all. So, for example, I could say hey, the message ID that I'm
going to first, let's say that we meet in person, right, in a dark alley or
something. And I say that the message ID that I'm going to first send you a
message on is, like, 129. So now we're good, right.
So based on that, now you can maybe fetch bigger messages. I could even send
you a new set of message IDs or even a new mapping once we've got the initial
bootstrap. I think that's the trick there is size.
Okay. So just a couple of minutes talking about this last challenge, which
I'll just pose, like, pose a position for you. I think we've seen in sort of
recent times a lot of governments restricting communications in various ways.
So, for example, we saw for example with the elections, the blocking of
Twitter, et cetera, we saw the Egyptians completely shut down the network. I
would posit that as governments get more savvy about how to use the network,
they're not going to shut it down but rather use it to manipulate the
information that we see. Or that citizens see. Because why would you shut it
down if you could use it to sort of influence public opinion or tilt the
outcome in your favor.
So I think basically what I see as an ongoing challenge is manipulation of
content and I'm going to talk briefly for a couple minutes about a more sort of
benign version of manipulation. But something that I think is still relevant,
and that's personalization. So in the best case, right, we're seeing many,
many organizations take our activities, our preferences, et cetera, and use
that information to sort of whittle down what we see. If we search for shoes,
if we search for books or network, whatever, our past behavior activities, et
cetera, dictate the types of things that we're likely to see, based on those
interests.
28
Now, in the best case, right, we're seeing things where -- we're seeing results
that potentially are already tweaked towards things that we already agree with
or are already aligned with our own tastes and interests. And one could argue
that's good. There are certainly positives to that.
But there is a flip side to that as well, which is that we don't have control
over how that's being done, and we actually don't even know sometimes what
we're missing. So one of the things that I'm looking at now is how to provide
users better visibility and control into how that sort of restriction or
filtering is being done, if you will.
I'm going to describe, this is very much a work in progress. I'm going to take
like a minute or two to describe kind of the low-hanging fruit that we're doing
here. So we've started off looking at search. In particular, when you search,
your question IP, what are other people seeing when they search for the same
term? Or you might actually want to run the same query in different ways,
maybe at different personas or something like that.
So we're starting off with something very simple, which is to say take a query,
and then run it from different geographies as different users and basically see
what turns up, what shows up on the first page. Where does it show up? What
doesn't show up? When it doesn't appear, can we explain those things?
Okay. So in particular, I had Ratul in here. I used the wrong slide. Well,
let's use Arvind. So you can -- I was going to pick on retool, since he wasn't
here. So I'll give you a different example. So if we search for Arvind, a
professor at the University of Washington, you'll basically get a bunch of
search results. But what this chrome program we've built does is also tell you
other things that you didn't see. Other search results that came up in a
particular user's top ten depending on where that search was run.
So right now, we've basically built this tool, which is called bobble, that
allows you to basically see search results from other types of perspectives.
In particular, we're are basically just focused on geography so far with the
tool. But as we work on this, we're expanding it to work on sort of how these
queries differ based on your past search history and other types of contexts
that you might imagine.
So ultimately, this is basically just the first step, because we're looking at
how to improve visibility.
29
Ultimately, though, you might imagine a tool where a user actually provides
some feedback or control into the types of results they see. So an example of
that might be, I might want to run a query as a particular persona or as a
particular group. So, for example, the results that I see when I search, say,
for example, Seattle, right, I might want the results to differ based on
whether I'm querying as a food connoisseur or marathon runner or networking
researcher or what have you. I might want to see different types of things.
So that's the type of direction I see this work going.
So just in conclusion, I've talked about how different parties are vying for
control of information on the internet in a variety of different ways and I've
spent probably the majority of the time talking about work that we've done to
combat message abuse, but I've also talked about other problems related to both
censorship and just more recently looking at how different tools and algorithms
may ultimately be used to manipulate the types of information that we do or
don't see, and I'm working on ways to sort of tilt the balance back in the
hands of the user as well.
So thank you very much.
>> Jaeyeon Jung:
Any quick questions?
>>: So message filtering has been a long history of just cat and mouse and cat
and mouse. Particularly the content-based filtering seems to have merely
trained the population to get, you know, [indiscernible]. Is there grounds for
optimism that are intrinsic thinks in the network level approaches that you
described that would give us tools that are beyond the reach of their ability?
>> Nick Feamster: Yeah, this is something that I think requires -- it
little bit more formal study, but I would say informally, we'd like to
features that are more costly forked a very assess to adapt to. If we
content, for example, it's fairly low cost to change the way a message
encoded or embedded.
bears a
look for
look at
gets
If we look at network level features, adversaries can still adapt. In
particular if we look at the behavioral features that I showed, you could
certainly sense spam from less dense, you know, in a less dense way. But
presumably, I mean, lurking behind that, we assume that there is some cost to
30
adaptation.
>>:
And --
On both sides?
>> Nick Feamster: On both sides, right. To take that example, in particular,
if you were to say that you can only send spam from, you know, a certain number
of IP addresses within a range before you sort of trip some detectors, then
presumably you've imposed some kind of cost, maybe in reduced volume, certainly
for a region of space.
I don't know how to model that cost. I think it would be a very interesting
problem to sort of try to figure out, okay, now can we actually model an
adversary. On the flip side, actually, I think there's like in the censor
side, as well, you could ask the exact same question. It's like, yes, but how
do we know that the censor isn't going to adapt to try to detect the techniques
that I showed.
And I think, again, there's like a really interesting question in trying to
model the capabilities of the adversary. In theory, the detectors in that case
is unbounded. It could do just about anything. It could look at the fact that
you never send mail or you never browse the web at 3:00 a.m. on Sundays.
>>: It's not an economic motivation so it's harder to say what level
[indiscernible] go to greater lengths than the spammer [indiscernible].
>> Nick Feamster: Exactly, um-hmm. It is potentially a tougher thing to
model, particularly if you think -- if you feel the government is having
unbounded resources to throw at the problem, then it certainly is trickier.
You could potentially do the economic model if you talk about, instead of a
government with potentially unbounded resource, if you talk about, say, folks
who are interest in DRM, for example if they have a certain amount -- I mean,
there are economics at play in those kind of situations.
>>:
[inaudible].
>> Nick Feamster: Seemingly, yeah. But I think on either problem that you
look at, there's sort of like more work to be done in terms of modeling the
adversary.
>> Jaeyeon Jung:
Okay.
Let's thank Nick one more time.
Download