>> Srikanth Kandula: It's a pleasure to introduce John... University of Washington where he was advised by Tom Anderson...

advertisement
>> Srikanth Kandula: It's a pleasure to introduce John John. John John is from the
University of Washington where he was advised by Tom Anderson and Irene Kristimoti.
His interests lie at the intersection of security and networks and distributed systems. He's
done quite a bit with bots and understanding botnets and how they work, which is what
he's going to talk about. I will let him go ahead.
>> John John: Hi. Hello. I guess the mic is working so, I was told that this is a 90
minute slot, so I tried to make sure that I use up all of the time.
[laughter]
>> John John: I am just kidding. All right. Now that I've got you all sweating, let's start.
Good morning, I am John from the University of Washington, advised by Tom Anderson
and Irene Kristimoti. Today I am going to talk to you about understanding malware on
the internet, botnets and all the other bad stuff that goes on. Let me start by saying that,
you know, the internet is not a very safe place today. You don't want your kids roaming
around alone there. There are a lot of prevailing problems. First there is spam. There
are over 100 billion e-mails a day. Everyone gets it; everyone hates it, estimated
productivity loss of billions of dollars a year.
And then you have denial of service attacks. And these are becoming more and more
powerful each day. A couple of years ago you had one strong enough to take out a small
country. And more recently we've seen attacks against big financial institutions, like
PayPal and MasterCard, which cost them real dollar losses. And another thing which in
today's ad driven internet economy, is click fraud. Recent studies have shown that nearly
20% of all ad clicks are fraudulent and that is something that people want to really fix.
And there is also other affiliated bad stuff that happens like phishing, so this is where
they send you a page that looks like your Bank of America logon page, you enter your
username and password and they steal all your money. And this again cost hundreds of
millions of dollars in losses a year.
So what is the common thread here? It turns out that the common thread for all of these
problems is usually botnets. They provide the underlying infrastructure which allow
hackers and bad guys to carry on their activities with impunity. And what exactly do I
mean by botnet here? So botnet is a network of compromised computers that are
controlled by an attacker. And the attacker has complete control over these machines,
can use it to send whatever messages he wants and participate in any malicious activity of
his choice. And since they are such a big part of the internet malware infrastructure,
understanding botnets is kind of a necessary first step to understanding these threats and
combating them.
Unfortunately botnets are not really well mapped out today, so Windsurf would say that
approximately a quarter of all computers connected to the internet are infected and
malicious. Now that's probably not true, but who knows? And at the height of the Storm
fervor in 2007, there were a bunch of news articles in September which said the Storm
botnet has 50 million nodes; it's the end of the world. A month later you had researchers
actually measure and understand and they revised it to a more reasonable estimate of
20,000 machines. So it wasn't too bad and there is still life going. So what this shows us
is that there is a lot of confusion and fear and uncertainty and doubt regarding these
botnets, you know? The sizes, who is participating in them, what do they do, what are
their operations? And our idea was to kind of throw some light on this to understand
botnets better. So it turns out understanding botnets is really hard. And the reason for
this, they, these malware authors they do take special steps to make sure that you don't
easily understand how they operate. They make use of sophisticated techniques for
evasion and reverse engineering is difficult because these bot binaries are obfuscated,
primarily to prevent this sort of [inaudible] engineering. And final analysis is very
difficult; it takes a lot of time.
And as you can see the poor grad student has to work hard to manually analyze these
binaries and this approach does not really scale well, at least, unless you have a lot of
grad students. So eventually the problem is that there is no, there is a lack of a
comprehensive botnet monitoring platforms, something which makes life a lot easier,
right? And so that brings us to our goal. Our goal here is to build a system which can in
a timely fashion with minimum human interaction monitor botnets and their propagation.
So I have highlighted a bunch of terms here and let me just tell you about each of them.
So we want this to be done in a timely fashion. Botnets are constantly evolving and
changing and information about them becomes less valuable the further along it gets.
The information degrades quickly so we need our things to predict pretty quick. We want
to scale so clearly we cannot have the human completely in the loop so we want to
minimize the amount of human interaction required. And finally, you know, we want to
have a comprehensive system. We want to monitor not just botnets and their activities
but also the complete lifecycle, how they propagate and all the details regarding them.
And this would give us information which can help combat attacks in real-time. And I
will give you examples of this as we move along with the talk.
So as I mentioned, there is a need for a comprehensive monitoring platform. We want to
monitor the complete lifecycle of the botnet. And this figure here gives you a rather
simplistic view of the botnet lifecycle. So you have bots which perform botnet activity,
which are all the malicious things I mentioned. And they also try to infect new hosts
which again participate in more botnet activity, a rather simple lifecycle. So in order to
study what bots do and how they operate, we first started with Botlab. The focus of
Botlab was to study the activity, the communication patterns, how they are organized and
how do attackers control these millions of machines. And as a part of our study one of
the things that we found was that the second step of infecting the host, is in fact a rather
complicated and more involved step.
So typically, traditionally malware used to spread through taking advantage of
vulnerabilities in applications or operating systems and compromising the machine. But
as these operating systems and applications get more and more secure there is definitely a
push towards better security, it becomes harder to exploit these loopholes. And so
attackers have been moving towards a simpler approach. So the thing is even if your
operating system is really secure, you still have the human who is probably the weakest
link in the chain. And a naïve user is very likely to click on any link and install any
application you ask him to. And so there has been a trend towards having more social
engineering attacks in order to infect hosts and spread malware. And you just need to get
someone to click on the link, and these links are spread through e-mails, instant
messengers and even search results. Again more details shortly.
So one of the things we found in our research was that these links which are used to
spread malware typically our regular legitimate web servers that have been compromised.
So these are regular websites which have been compromised and are being used to host
malware. So this brings us to the question of, you know, how are these web servers
actually getting compromised? So we wanted to study this as well in order to make it
more difficult for attackers to go about doing this. And a step before this is in order to
compromise these web servers, you must first find vulnerable web servers. So the
question here is how do attackers go about finding such vulnerable web servers and then
compromise them?
So our goal was to kind of study this entire botnet ecosystem, study the lifecycle, come
up with defenses. And this brings us to the contributions. So for each of these various
steps we came up with systems which would measure, understand and come up with
defenses. And these have been over the last few years NSDI Internet Security and more
recently [inaudible] and WWW. So let me give you a brief outline of what each of these
things does, so this SearchAudit which studies how attackers go about finding vulnerable
websites, web servers on the internet. And then we use heatseeking honeypots which
takes it one step further. Once attackers find these vulnerable web servers, how do they
go about actually compromising them? So this, our honeypots came to study that. And
once the servers are compromised and used to host malware, tons of attackers spread
these links to search results so the goal of deSeO was to study how these malware links
are being spread through search results. And finally we started off with Botlab which
gave us an initial picture of how botnets operate and what activities they participate in.
Let me start with Botlab, which essentially kick started our research into bots. Yes?
>>: [inaudible] you said there were four objectives, three objectives to real-time monitor
[inaudible] propagation. Based on what I see from the slides they are not finding long
term activities [inaudible] bot has skill [inaudible] study bot behavior [inaudible] and the
mainstream attacks are mostly off-line, study off-line?
>> John John: Not necessarily off-line.
>>: Is it like real-time monitoring, all this?
>> John John: Reasonably real-time, as in on a daily basis. So these are things which
run on say web server logs on a daily basis and come up with more information regarding
how these botnets operate. So it gives us some idea as to how these things operate in real
time but as a research prototype, it is not necessarily real, real time, but expected realtime, order real-time. Okay, I am going to start off with Botlab which was a project
which kick started our involvement with bots. And the stuff that we learned here
essentially led to the other remaining steps of understanding how bots operate. So what
exactly is Botlab? So let's consider botnets in the wild. You have millions of infected
machines and each of these botnets talks to a command and control server. And that is
different botnets have different command-and-control servers which give them
instructions on what to do. So these botnets talk to the servers and the server tells them
who to send spam to, which website to attack and things like that.
So the goal of Botlab was to kind of take a small-scale version of this and study this
botnet ecosystem in a locally contained environment. So we have captive bots which we
are running in our virtual machines in our contained environment. We study who are the
command-and-control servers they talk to. What are the instructions they get; what kind
of activities do they participate in. What spam do they send? And essentially get a feel
for what bots look like in the wild. And our approach to scaling this sort of botnet
analysis was to automate it. And here what we do is we do eliminate the manually
intensive task of reverse engineering these bot binaries; we use a black box approach.
And what is a black box approach? We essentially execute the binary and study the
external behavior. And in order to do this in a scalable fashion we needed to automate
this process of finding and executing these bot binaries.
So what does Botlab need to do? Well first and foremost we need to continuously find
and incorporate new bot binaries. As I mentioned they keep evolving; they keep
changing, so we need to keep track of them as they change. Then we perform some sort
of initial analysis to pick interesting binaries. There are lots of malicious binaries out
there, but we wanted the ones out there that we are currently interested in and in this case
we are looking at botnets that send spam. So we want some initial analysis to select these
interesting binaries. And finally we need to execute these binaries safely and collect the
data that we want.
And the Botlab architecture is essentially the same three steps but in a diagrammatic
form. And the first step after spy plane is obtaining bot binaries. How would you go
about obtaining malware? The traditional way of collecting malware is through
honeypots. So we had set up--so what is a honeypot here? The idea is very simple. You
take a new machine install an unpatched operating system, preferably Windows,
connected to the internet and wait for 5 minutes or maybe get a coffee. 5 minutes later
you are infected and you have a bot. You rinse and complete this process and you collect
a whole range of these malware binaries. And in a mere two months period, we collected
nearly 2000 such binaries from honeypots. And unfortunately we did not find any of
these spamming boards which are a more recent variety. And most of the botnets that
infected our honeypots were traditional IRC botnets, which are on the decline.
And the reason for this was that this new generation of malware spreads through social
engineering more than by exploiting vulnerabilities. To give you an example, here is a
social engineering attack. You might receive say and E card for Valentine's Day that
says you have received and E card please click on this link to view the card. You click
on this link it installs a binary and you've essentially got an infection. Another example
you would find is if you visit a site and you want to view a video, you occasionally find
these little pop-ups that say that your current version of flash is outdated please click on
this link and install it to view the video. You click on the link and you've just become the
latest member of a new botnet.
So the point here is that our passive honeypots cannot capture these sorts of attacks so we
need to augment our honeypots with active crawling. Right, and for this we essentially
need to emulate a user. And fortunately we just need to emulate a naïve user who clicks
okay on everything and we get our set of new binaries that we want to deal with. So what
is our source for getting these binaries? Botlab gets a constant feed of spam from the
University of Washington and this is on the order of 2 1/2 million e-mails a day, 90% of
all the e-mail that comes to the University is in fact spam. And nearly 1% of these URLs
point to malicious binaries, malicious executables and drive-by downloads. We crawl
these links and fetch the binaries. In addition to the spam we also get binaries from
public repositories of malware and public honeypots. So this kind of fills in the first step
of our picture which is how do you obtain these bot binaries?
And the next step is to analyze the binary. So at this stage we have thousands of these
binaries which we obtained and we want to do two things. First we want to select
spamming bots to identify which bots send spam. And secondly want to eliminate
duplicates. Unfortunately a simple hash is insufficient to detect duplicates for perfectly
familiar binaries. This is because malware authors frequently repack their binaries to
escape the sort of hash-based detection. So what we do instead is to kind of generate a
behavioral fingerprint of each binary. So we execute the binary for a while and we log all
the network connections, so we see which IP address port and how many packets they
sent. And now we do this for all of the binaries and we compare any two binaries and if
they have a similar network fingerprint then we say that they are duplicates. And this can
also, if we see them trying to make connections to port 25, we know that they are trying
to send spam and they are interesting spam bots. So at this step we have interesting
binaries that we want to monitor.
So in the final step we, the third step is to actually execute these binaries and collect data.
So here we have an interesting trade-off between safety and effectiveness. On one
extreme we can say you know we are going to let Botlab send out any traffic. This would
be really effective because these bots get to communicate with their control servers, they
get to do whatever else they need to do but the flipside is that you are now adding to the
bot population and if you end up infecting some of the machines you've got all these legal
issues to deal with. The other extreme is to say we are going to make this a completely
contained environment and these bots are not allowed to make any external connections.
Now this is not going to be effective because as I mentioned earlier these bots need to
talk to their control servers, or C&C servers and unless they talk to these C&C servers
they don't really know what to do.
So in our case we decide to pick a middle ground where we say things like traffic to
known vulnerable ports are dropped; traffic to privileged ports are dropped; we place
limits on connection rates and data rates so that you don't participate in a needless attack.
And since we are dealing with spambots, we don't want them to send a lot of spam so all
the spam that is attempted to send is directed to a fake mail server. And at this step we
have a system which finds bot binaries, picks interesting ones, and executes them. The
bots are happy because they get to send tons of fake spam, and we are happy because we
get to see what they are trying to do. Now most of the bots run fine… Yes?
>>: [inaudible]
>> John John: Yes. We only route traffic to known privileged ports, and most of these
bots in fact are [inaudible] to HTTP. So we do allow traffic to go through, go to the
command-and-control service. Yes?
>>: What [inaudible] connection may compromise the Web server?
>> John John: That is something which is difficult to do and that is one of the reasons we
decided that this might not really be feasible in the long run. For the purposes of the
study we did look back and see that it only contacted actual C&C servers, but it is
possible that it could masquerade as a C&C connection and try to compromise a Web
server.
>>: How do you draw basically [inaudible] because typically the connection runs on
[inaudible] so how are you going to recommend what to connect…
>> John John: So we only drop all connections to privileged ports. So any ports that it
tries to connect to say port 3389 to a small desktop servers we drop that on that pile ports.
Ports with known abnormalities, those are dropped. So that would mean that we end up
missing a few bots which might require this in order to get active.
>>: So it seems that when you are doing this [inaudible] you are making some
assumption about what kind of activity they are doing. What kind of behavior, what kind
of rules. Wondering if there are bots that are not [inaudible] and your description
actually will make the bots unusable or [inaudible]
>> John John: So in that case we end up not seeing these bots. We only were able to
capture bots which kind of fit into some particular set of rules which we thought to be
reasonable.
>>: Among all the bots that you catch, what's the fraction of them that remains that
conform to the rules you set?
>> John John: In terms of spamming bots, we are finally able to run around 11 or 12 of
them.
>>: And this is out of how many?
>> John John: That we are not really sure because the others don't get to send spam. So
we find 11 bots that actually send, 11 different botnets that actually send spam. And we
found that this corresponds nearly 80% of all the spam that comes into the University is
from these 11 different botnets, actually only seven different botnets. So we do see a
reasonable coverage in the terms of botnets that actually send spam.
>>: I wonder if there is a better way for you to, beyond restricting the properties so
heavily at the same time [inaudible] much better coverage [inaudible] just redirect all the
meds to some fake IP address [inaudible]
>> John John: But you would still need to have, for each botnet, you would need to
figure out what is the, it would not be the information would not be really high fidelity
because now the bot is not necessarily sending spam to the people who the controller
wants it to send spam to. Or the kind of spam that it is sending would be pretty much
different. So there is definitely a trade-off between fidelity and safety and that is
somewhat difficult line to figure out.
>>: I might've missed this, but are you associating spam to [inaudible] to a known
[inaudible].
>> John John: I haven't yet mentioned that so that is kind of the next part of the talk
based on what we study from these bots, how can you gain additional information about
them? So I will talk about that in a couple of slides. So one of the things we found here
was that most of the bots ran fine but in some cases we do need to do some manual
tweaking in order to get them to run. So here is an example. One of our bots Mega D
actually verified that it was able to send e-mail before it got activated. So when the bot is
running it would send a test e-mail to a specific mail server which is controlled by the
attacker. That would return a special code which is the activation code and the bot
needed to send out this activation code to the C&C server before it got activated. And if
the code was incorrect the bot would essentially refuse to run. So in these cases we had
to allow a few connections out in order to get it to run.
>>: When you say you're bot detection is automated, it seems that…
>> John John: It is mostly automated, with some manual tweaking.
>>: It seems that in the beginning you do have to go through a lot of manual process just
to write up all these rules. And then you can use some detection by matching their
behavior with some signature that you already have but it's not like that's, you [inaudible]
automatic that generates the signatures or you can only detect on [inaudible] based on
existing signatures.
>> John John: It's usually based on existing signatures because you could potentially
come up with a scheme to automatically detect these signatures, but that would involve
letting the bot run unhindered for a while to observe its actual behavior.
>>: But are you able to express this initial [inaudible]
>> John John: In terms of which?
>>: For example the natural behavior of this bot versus that bot [inaudible] use this
[inaudible] are you able to extract all this…
>> John John: Yes. Each bot runs for a few minutes inside of VM; we look at all the
network connections it makes and that becomes the signature of the binary.
>>: You used a raw network traffic as a signature or are you extract features [inaudible]
>> John John: We extract features, such as the which IP address it connects to, which of
the DNS names it looks up, which port does it connect to, what are the packet sizes, and
use that as a signature. So this is one of the cases in which it required some sort of
manual tweaking to get these bots to run. But there are a couple of other challenges
which we occasionally face. So one of the main problems was that there were some bots
which would detect and they were being run inside the virtual machine and they were
self-destruct. So in these cases we need to actually have physical bare metal boxes to run
certain bots. And one of the bots we had in fact did try to use webmail so after using
SMTP, it would connect to Hotmail, login with a stolen username and password and then
send mail. So in this case we did set up a man in the middle so, to look at the credentials
but this was again a very small botnet which did not do a lot of spam so it was not a big
deal here.
So what we have here are two really interesting streams of data. On one hand we have
our bots in Botlab, a few dozen of them which are constantly churning out 6 million emails a day. And what's great here is that with these captive bots you get to see all the email sent irrespective of destination. You see spam sent to Hotmail, Gmail, Yahoo. You
see spam in various languages, and so on. So we essentially have a tiny slice of each
botnet and this gives us a very local view of the spam producers but a global view of all
the spam that is being produced because you see kind of a wide variety of spam sent to all
places. On the other hand we have spam coming in to the University of Washington. So
this is another two and half million e-mails a day which provides a completely different
perspective. So here we receive spam from pretty much every bot node in the world. So
if you have an infected machine, it is likely sending you spam. In a couple of days you
would definitely see spam from pretty much all of the infected nodes out there.
So this kind of gives you a global view of the spam producers. You see all the spam
producers but a very local view of the actual spam, because you only see spam that is
coming in to the University of Washington. You only see spam coming to
Washington.edu; it's mostly English spam mostly targeted at students perhaps. And so
this gives you a local view of the…
>>: [inaudible] that you received from almost every bot in the world. I guess I am going
to challenge that in terms of my assumption is if you have a relatively small number of email addresses [inaudible]
>> John John: Right around 1%. Or it's around 300,000 e-mail addresses.
>>: Yeah that's right but that's come on…
>> John John: .1%.
>>: Wait a minute. 300,000?
>> John John: 300,000 e-mail addresses.
>>: Out of a billion, a couple billion? I guess I'm wondering…
>> John John: So we essentially see spam from…
>>: I would, I would accept that you see from a large number of bots but I wouldn't
necessarily say it was every bot.
>> John John: Not necessarily every bot, but we do see on the order of 1 million IP's a
day. So that is a fairly large number. Yeah, let me rephrase by saying a fairly large
number of these bots.
>>: Well if you say a million I'm, I would bet out of a billion PCs, more than a million of
them are on botnets.
>> John John: This is 1 million a day.
>>: Okay.
>> John John: And over a reasonably long period of time you would see a larger
fraction.
>>: Have you compared this to any external data feed or you could certainly go to
SpamHouse or one of the open public lists of spam websites to see what fraction that you
are getting visible to you versus what is being observed in the world.
>> John John: The problem with SpamHouse Blacklist is that you don't really know the
number of false positives and false negatives in there. But from comparison, what we
found was around 30% of the IP's that we actually observed are present in SpamHouse.
>>: It's the opposite number that you really wouldn't know.
>> John John: Yeah. So you don't really know what coverage SpamHouse would have
either, right, especially since we have no idea of their false positives and false negatives.
>>: This should fallout statistically. I mean, right, if you know that maybe 5% of
SpamHouse is false positives and you know they are saying 100 million IP addresses and
you know you are seeing 2 million, then you know you are seeing 2,000,000/95,000,000.
>> John John: SpamHouse Blacklist are not, they are not in the hundred million range
though. I believe, so I do get, I do have access to the SpamHouse Blacklist and they are
roughly in the 3 million IPs on a daily basis, is what you see. And this kind of changes
over time as things drop out and things get re-added in. So the SpamHouse Blacklist
which we looked at had on average 3 million IPs a day.
>>: Are you comparing your numbers with someone running ICSI were [inaudible]
honeypots?
>> John John: Yes. So we do share our information with them. This was done before
their honeypots came up.
>>: No. This was published in 2009.
>> John John: 2008.
>>: 2008. So their first paper, spam architecture…
>> John John: Spam Analytics was before that, yes. But their actual spam-- but that was
an incoming spam feed right?
>>: Yes. But this one [inaudible]
>> John John: Yes, yes, yes. Right here was incoming.
>>: So I am just saying, the numbers for incoming spam, how many IPs they see
operating everyday, how much spam they see operating everyday compared…
>> John John: That I have not really chatted with them about their actual numbers versus
ours.
>>: So they didn't report in their paper?
>> John John: Their paper didn't have the number of IP addresses. They were looking
more at the hosting of the spam campaign. So things like where are the web servers
hosted and those kinds of information. Not the actual number of e-mails that they
received. So one point here is that by combining these two feeds of information you're
going to get a lot more than any of these individual feeds, right? And the question was
how would you go about linking this? And what we observed from our data was that
spam subjects are reasonably special. So they are chosen quite carefully for two reasons.
First they have to escape your spam filters, and second they have to be interesting enough
for you to want to click on them. So as a result from nearly 6 months of our data, we
found that there was absolutely no overlap in the subjects between two bots. So this was
on an average of 500 subjects per day per bot and we found zero overlap across any two
botnets. So we decided that looking at spam subjects and comparing them would be a
good way of linking these two different streams of data.
>>: You're only looking at spam that gets caught by ESS.
>> John John: Yes. We do not have access to the other e-mails.
>>: It's because less than 1% of the e-mails, so 99% is [inaudible]
>> John John: 90%.
>>: Okay so the remaining 10%, 90% of that might be great spam that is escaping
your…?
>> John John: Yes. That is possible but we don't have access to that data.
>>: And those might be other botnets that are [inaudible] higher-level [inaudible]
>> John John: Absolutely. They might be different botnets but in terms of actual total
volume it would still be a smaller fraction, even though they might be more effective.
>>: Yes. They might be affecting PCs far more often.
>>: [inaudible] classified with those others there are millions and millions of [inaudible]
>> John John: That will come up in the next couple of slides, the number of botnets we
found. From this linking these two streams of information we decided to ask a couple of
questions. There are more questions in the papers but for now I am just going to look at
who is sending all the spam and what are some of the characteristics of these botnets. So
the first thing we found was that nearly 80% of all the spam came from just six different
botnets. And a single botnet called Sizbi was responsible for almost 35% of the spam.
So this kind of means that if you could knock a few of these botnets out, you are going to
significantly reduce the volume of spam on the internet.
The question now is how difficult is it to actually knock out a botnet? So for that let's
look at a couple of characteristics of these bots. So the first thing we observed is that
most of the botnets we ran contacted only a very small number of C&C servers, on the
order of a dozen. And in many cases the information about which IP address to contact
was hardcoded in the binary. So if you could essentially block access to this IP address
the bot would be headless and it would have no idea of what to do. And in fact in
November 2008, a hosting company Micolo in California was taken down by researchers
and law enforcement and overnight the volume of spam decreased by almost 80%. And
the largest botnet Sizbi was knocked off-line and never came back. So some of the other
interesting characteristics that we found about these botnets was that you could possibly
fingerprint which botnet was sending the mail based on the spam sending rates. So this
varies from 20 messages a minute to a crazy 2000 messages a minute. We also looked at
that kind of mailing list over to botnets, so which, if you are a spammer you would
preferably want to rent out multiple botnets in order to reach a wider audience, because
two botnets the overlapping mailing list was only 30%.
And finally we also looked at the active slices of these botnets and that varied from
16,000 to 130,000 with respect to how many of them are actively sending spam on a daily
basis.
>>: How did you get the [inaudible]
>> John John: Based on what we see, based on our incoming spam. So every piece of
spam that comes in, then we can, for 80% of the spam that comes in we can say which
botnet it belongs to and by looking at how many different IP addresses, so this is
definitely a lower bound of the number of spambots.
>>: These characteristics, so it seems that like just the first two of them are not
[inaudible] maybe because they are because they have such a large volume they don't
care if they are caught by anybody. So maybe be smarter if you don't have to contact if
it's set up you don't have to send this many again they can adjust their sending rate.
>> John John: So back in the day most of the botnets that we looked at had very simple
central control mechanisms. They would contact one fixed IP address or a bunch of IP
addresses. And that has kind of changed slowly over time. And it is still has not moved
to a decentralized control network. It was only the Stormbot net [inaudible] centralized
network, but all the other botnets even today still use a simple HTTP central controller.
The way they access the controller has changed. So now instead of contacting a
particular IP address they use a DNS name which is algorithmically generated. So each
day it would generate a DNS name and look that up. And one of the problems with these
approaches is that if researchers otherwise engineered this, what they do is they pick a
date which is say next month and they buy that domain name. So on that day at least you
got control of all the bots. So a simpler mechanism would ensure that there is no
infiltration, but they do have techniques to kind of spread out a bit.
>>: [inaudible] have this characteristic, the biggest
>> John John: These are the biggest botnets.
>>: Yeah I think we [inaudible] the smarter, this Storm botnet. They only generated a
small number of e-mails the small but it was hard [inaudible] may be more effective
[inaudible]
>> John John: I think Steffen’s group actually did look at the Storm botnet and in terms
of actual delivery effectiveness it's the same as all the other botnets. It was not any more
effective than all the other bots. And they also had this additional problem that since
their control network was decentralized as a part of ADHD, researchers could easily
infiltrate a ADHD now had control of these bots and could make them do what they
wanted. So it's kind of a great opportunity having full control and having reliability.
And so far it turns out that botnets have not needed to branch out and even today they still
use a central controller with a small number of servers. And the main reason for that is
that if the servers are not hosted in the US, if they are hosted in Eastern Europe or in
Russia, it's really hard to take them down legally. And so this is sufficient for their
current purposes.
>>: [inaudible] best botnets in different countries. So how do you [inaudible] solve a
known [inaudible] because you are saying that they are centralized but [inaudible] these
places but [inaudible]
>> John John: This is the way it was in 2008. And as I mentioned as of now things have
kind of diversified a bit. But they still only contact a small number of hosts. They do not
need to contact a large number of hosts yet, because they are hosted in different
countries. So there is no legal framework which makes it easy to take down these nodes
so a small number of nodes is sufficient for them to get by. But naturally they would
diversify and have a larger set as you go-- it's an arms race, which as you raise the stakes
they are going to try something different. And so this is the state of botnets as it was a
couple of years ago. Okay so what did we get with Botlab? So what Botlab gave us was
a real-time feed of malicious links and spam that was being sent out. This would be
useful for having safer browsing because you know first time what the bad links currently
going out or before they have been caught and added to safe databases. And also better
spam filtering because now you have a pure feed of spam that is being produced and this
was used again by UCSD folks to generate a signature-based detection for botnet spam.
And the details of command-and-control servers and the communication is useful for
detecting network level bot activity and also for CNC takedown. So the information
from Botlab was provided to law enforcement agents and antivirus company so they can
go about doing their business.
So now we have seen kind of the end result of what botnets do. And the rest of the talk is
going to focus on how they propagate and how they work. Yeah?
>>: [inaudible] Focus on spam saving bots. Some of the detection mechanisms
[inaudible]
>> John John: So the other things bots do I mentioned in my initial slide, I said this is the
bad things that happened on the internet. You have got click fraud, bots that do click
fraud; you have got bots that send out phishing attacks, bots that do denial of service
attacks. The reason we picked spam was that it was easy to see both parts of the attack.
You get to see spam that is being sent out and you also get to receive spam. This is not
really true of click fraud. You need to be a search engine or a large advertising provider
to observe click fraud from the other side. And the same thing with denial of service
attacks. Unless you are a big company, you don't get to see denial of service attacks. So
the reason we picked spambots was so you can get both sides of the picture. And other
bots are also similarly operated. In fact a lot of the spambots they partition some portion
of their botnet to do various activities and these are rented out to people who want to do
different things. And we do see some click fraud in the outgoing fashion but we did not
focus on it because we could not get a complete picture. But some of the techniques we
used here could also be used for other kinds of botnets because they use similar control
mechanisms.
>>: So based on the information provided by botnets [inaudible] so what about just
directly use the spam feed where they come from [inaudible] and use our spam detection.
It seems that this approach could be very similar [inaudible] depending on which idea,
but what are [inaudible] does seem to be a much simpler approach so what are [inaudible]
>> John John: So here you kind of get attribution. So there you are going to see that
these are the botnet IP addresses out there but you don't really know which botnet they
belong to.
>>: [inaudible] based on e-mail subjects [inaudible] cluster them…
>> John John: You can cluster them; so that is something that we learned from Botlab.
So you don't have to keep running Botlab you could use the information you gained here
in order to continue without it. The fact that the spam subjects are in fact unique is
something that came out of our running of the actual bots.
>>: So you use honest signature to attack the [inaudible] subject [inaudible]
>> John John: We also did look at other signatures and their SMTP headers which kind
of suggested that they were the same thing.
>>: But SMTP headers are going to get the same information from, from the e-mail
spams directly as well, right? So what other significance [inaudible] botnets [inaudible]
>> John John: The only thing that we know is the set up subjects that are being currently
set out by each botnet. And we found in these subjects that these are unique. And this
information can now be used even without the presence of Botlab; that is true. Botlab
was kind of a bootstrapping process which let you understand how some of these bots
operate.
>>: What I mean is I can extrapolate this same information about the botnet from just
directly getting the e-mails [inaudible] based on what emails I get in and then use a spam
filter [inaudible] same subject in same sender with the same headers so why do I still
need to [inaudible] and doing this kind of thing quite
>> John John: You get to see things like information like which of the command-andcontrol servers they contact. What is the control infrastructure like? So these are two
different aspects of the information that you would see. Okay so, let's sort of backtrack
and see how it all begins. So the question we asked was, one of the things he found was
that some of the [inaudible] which are used for self propagation are things which are,
which servers host malware are in fact legitimate sites which had been compromised.
And the question was how do you find, how do attackers find these vulnerabilities on the
internet? And the question is how do you find anything on the internet? You search for
it. So one example, so here is an interesting thing that we found was that search engines
are really good at crawling and indexing everything that is accessible. And in many cases
a poorly configured server might expose sensitive information that can be then used by
attackers so attackers can then craft malicious queries which would give them this
information. So let me give you a concrete example of what I say. So here is a posted
exploit for PHP waste content management system. So this is an application running on
top of your web server. And this is an application DataLife Engine. It's a content
management system.
And version 8.2 of this DataLife Engine has a remote file inclusion vulnerability which
means that any third party can store an arbitrary file onto this web server. And they
helpfully provide a search term which can be used to find such servers which in this case
is powered by DataLife Engine. So this is essentially, so you find in all these web
applications, in the case of DataLife Engine that at the very bottom of the webpage is
generated you have the stamp called powered by DataLife Engine and copyright
trademark and all the other things. You pop this into a search engine, in this case Bing,
and you find hundreds of thousands of servers. And some fraction of the servers would
in fact be running version 8.2 that suffers from this folder ability.
So now you no longer need the kind of brute force search the entire internet for all
possible, all potential vulnerabilities, but you have used a search engine to shortlist your
search to a narrow set of things that you can now easily attack.
>>: Where do you cover this [inaudible]
>> John John: That is posted on hacker forums. So you've got lots of these underground
forums where hackers share their information and you can kind of bootstrap your system
from this. Overall search engines kind of make it easier for bad guys to go about their
business, right? And our goal here given, now that you know what kind of queries
attackers use in order to find vulnerable servers, the question is can we use this
information to essentially understand how attacks happen and possibly detect new attacks
before they are out in full force. So we want to essentially follow attackers’ trails and
have them be our guides. In order to do this we have access to a good data set which
happens to be the Bing data set. So we had three months of sample logs from Bing. This
is 1.2 TB of data containing billions of queries.
And so with SearchAudit we have two stages. First we have the identification phase,
wherein we try to detect malicious queries and this is an automated process where we
start with a known seed set, expand it and generate a list of all malicious queries. And
the second stage is the investigation phase where we can manually analyze these queries
and try to understand the intent of the attackers. Then we quickly look at the
identification phase so here we start with a small set of known malicious queries. And
these can be obtained from a variety of sources, in our case we kind of look at hacker
forums. So one of them was hack forums, another is MilWorm where they post exploits
and the sort of queries you could use to to find these vulnerable servers. We crawled
these forums and we start with a seed of 500 queries. And this was from a clear period.
So we started with a small set. And now we have on one hand we have the set of seed
queries which we know to be malicious. And we also have the search log which is a set
of all the queries that were issued to Bing.
And then it becomes reasonably straightforward to see which of these malicious queries
actually show up in the Bing search logs. And once you have this, you kind of know who
are the people issuing these queries. And one of the things we find about attackers is that
they don't issue one query and stop; they issue a bunch of queries, and so you kind of
now have a larger set of queries which you did not find in your seed set but are able to
find through the search logs. The next step is to generalize these queries. So one of the
observations we made was that attackers don't use the same queries very often. They
make changes to suit their needs. So one of the things they do is they sometimes say they
restrict the domain to which they want search results to [inaudible]. So they might be
only interested in .EDU domains which are running on a particular site because they have
a higher page rank and are more valuable for their time. And in some cases we find them
adding random keywords to the query string so that you get a different set of search
results.
An exact query match, an exact string match does not capture these variations, so we
decided to use regular expressions. In this case we feed all of these queries to our regular
expression generator. This was [inaudible] which was in sitcom 2008, from folks at
SBC. This kind of helps you capture the structure, the underlying structure of the query.
And you get to match all queries that are roughly similar even though they are not an
exact match. And here is a regular expression tool, and once you have this set of regular
expressions you can run this on top of your search log and find all of the various queries
which are similar. And now once you have this you can think of this as your new seed
set and these are all malicious queries why don't you repeat this process. So we
essentially do that until we get a fixed point where we feed this back into our system and
look at the query expansion. And we find our final set of malicious queries. And
typically it converges on one or two equations and it wasn't a big deal.
So this is some of the data we have from a week in February 2009. We found these sorts
of malicious queries from nearly 40,000 IP addresses. They issued 540,000 unique
queries which are all different and for a total number of 9 million queries. So this
number comes from the fact that many queries that are repeated multiple times in order to
get different pages of the search results. So they issue the same query and look at the
second page, issue the same query look at the third page and so on. And so in a week we
found nearly mine 9 million queries for these kinds of vulnerable sites.
And what kind of attacks did we find from these queries? So this is a part of our analysis
phase where we looked at these queries to figure out what they are looking for. And one
of the things that we found, naturally they were searching for vulnerable web servers and
we found that nearly 5% of the returned search results show up in blacklist at a future
point in time and 12% of these return servers were in fact vulnerable to SQL injection.
And we also find queries which are trying to find forums and blogs to post spam
comments on. So you would find queries on the form SueAnnesblog@ comment@blog
post. Yeah?
>>: Regarding the first one. So if my query is only for [inaudible] something. So that's
actually returning all of the websites that run that software, but it was different for this.
Some servers are vulnerable some servers are not. So in terms of just finding [inaudible]
new set of rules to identify how vulnerable web servers…
>> John John: You're going to see a bunch of these queries and there are some other
queries which look at the--it's not just what is powered by, they also have things that are
special to the pot. So when the particular say what does 5.1 uses a particular set of pots
in the URL.
>>: I am looking for is do you think that you can generate a set of vulnerable web
servers just simply by analyzing the queries?
>> John John: Potentially vulnerable web servers. They are not all vulnerable.
>>: So when you say potentially, that means they could have a lot of this possible?
>> John John: Yes. In our case we found approximately 5% of them actually show up.
>>: So 5% is actually based on the future, some future event that you can cross check.
When you actually fine-tune for that, that means a large number of them are not
vulnerable.
>> John John: There are not necessarily vulnerable, but they are being targeted by the
attackers because they are issuing these queries with the intent of trying to compromise
them.
>>: Yeah. Okay. But basically you cannot.
>> John John: You cannot say for sure whether something is vulnerable purely on this.
>>: Just a quick question on this previous situation. So you have those [inaudible] ports
then you expand those port sets so how big was this set in the, and compared to those
initial ports that you found on MilWorm for example?
>> John John: So the initial ports was 500 queries and the final one was 540,000. But
there are kind of variations of similar ports. In many cases they are our other ports which
the attackers issued and in many cases there are variations where they add these things at
the end and beginning and keywords.
>>: But you said that you also expanded in all based on what else they were searching
for, based on IP addresses. So essentially these would not terminate and let you be the
whole Bing index, so how would you differentiate between this is a port and this is just
an IP that that looks up a stream on Bing.
>> John John: Only the ones which started with these ports, right? So we look at the IP's
that issued at least one of these ports and we look at the other ports that they issue or
other queries at the issue.
>>: But say, I am a attacker. And I Google for a port and then I want to search, you
know, what is in their rest of the toolbar today. So then I would get everything…
>> John John: Yes, so we also do look at least a large number of people who issued
similar queries. There must be overall criteria that had to be done. So we do a bit of
filtering before we just, we don't blindly take all the other ones. We do filtering to make
sure that that doesn't happen.
>>: What is the baseline for your top [inaudible]?
>> John John: Baseline in the sense?
>>: You said 5% or in [inaudible] randomly sample web servers on the internet…
>> John John: 25%.
>>: So this is 10 times more likely to be on a Blacklist?
>> John John: Yes, 6 to 10 times. It varied from .5% to 1%. So that was the baseline
and for SQL injection, we found 2% to be vulnerable if you search for random. And one
of the other attacks we found was actually an ongoing attack, an ongoing phishing attack
of live messenger user credentials. So you had attackers who would compromise a live
messenger user, and send out a link, a phishing link to the user who would then click on it
and be shown some things and his account would also be compromised. Looking at the
search ordered results we found nearly 1,000,000 such compromised accounts that would
have appeared over a year. And this was something, yes?
>>: [inaudible] how did you get hold of the compromised accounts maybe it's out of
scope but I'm…
>> John John: The way it worked was that when you click on one of these links what
happens is that it issues a query to Bing. The way they set it up, this was purely
incidental. It was not anything special to this attack. It just happened to make use of the
Bing search engine. It would issue a credit to Bing with the reference field containing the
username off of the person who clicked the account. And that's how we were able to see
which set of users had been compromised. And then we also later did some cross
analysis which showed that these accounts had in fact been accessed by the attacker from
an IP in Singapore and verified that these were in fact compromised.
As soon as an attacker starts this process of finding these vulnerable web servers, you
know which servers are in the crosshairs, so you can potentially proactively defend
against these attacks even before they are launched. And the search engine could try to
block such malicious queries and sanitize these results to make it harder for the attacker
to go about finding these things. So eventually we can use it to detect new attacks as they
come up and potentially also find the attackers.
Alright so the next step once you have this notion of which are the servers being targeted,
what we decided is how do we know what the attackers do next. What is their next step?
How do they actually go about compromising these machines? And in order to do that
we take a page from the attackers’ playbook. We create fake pages that look vulnerable
and these pages are now crawled by the search engine and one attackers issue these
queries, they get our pages and when they try to attack us we get to observe their attempts
firsthand. So let me quickly run through the architecture of these heatseeking honeypots
which kind of give you this information. First we have the malicious query feed which
we get from the SearchAudit, so we know the sort of queries the attackers are issuing to
the search engines. We issue the same query to Google and Bing and we get these pages.
We fetch these web pages and store them and set them up in our honeypots. These are
now crawled by all the various search engines. The next time an attacker issues a query
for a similar term, our pages get returned. And then they kind of try to attack us and we
get to see firsthand how they go about this.
>>: Your pages will be returned way down the road [inaudible] right?
>> John John: Yes.
>>: And so why [inaudible]
>>: Because they want 1 billion of them.
>>: Even though you might need 10 million as a target…
>> John John: 1000 pages we still get and since we have people on .edu and
Microsoft.com linking to our honeypot pages [inaudible] higher than it should otherwise
be.
[laughter]
>> John John: So once they find these attacks we kind of install an action software to see
how the compromise really happens. And the actual compromise is going to be a lot
more straightforward. So osCommerce, is a web software for managing shopping carts.
So if you are running an Amazon like website if you wanted to have a shopping cart you
would use OS commerce. And if the site is hosted on example.com/store the way you
would actually compromise the site is straightforward. You would visit this URL and
present the file for them to upload and now it is hosted on your web server.
And this could be any file. It could be an executable file. It could be a PHP file which
can essentially run with the privilege of your web server. And what they do after this is
quite interesting. Most attackers typically host a PHP based file management system. So
it is like a shell which gives you a graphical interface to delete files, upload files, change
permissions, perform a brute force attack of your /epc/password file and whatnot. So this
is typically one of the typical things that attackers do after they have compromised the
server. And they can host any amount of malicious files on here and then send out links.
So from our honeypot we set up nearly a hundred…
>>: [inaudible] change computers or something so [inaudible]
>> John John: It does not, at least for the smaller web servers. And actually the ones
which are attacked by this are not the well administered once; these are free open
software and usually run on smaller servers without a good security box. So we ran our
honeypot for three months and with 100 pages set up we found nearly 6000 different IP
addresses [inaudible], so not as large as it would have been if we had had a highly ranked
site. We had nearly 55,000 attack attempts. And the honeypots saw all sorts of different
attack attempts such as trying to get admin access, brute force password attacks, SQL
injection, process scripting and the whole gamut of things. Yes?
>>: [inaudible] attacks were result of the honeypot instead of just random [inaudible]. In
other words I ran a number of web servers that are constantly just getting scattershot…
>> John John: Right. So we do have a baseline case of just a [inaudible] running their
just a web server. And we see four different, five different attacks, whereas with the
honeypots we see a larger variety of attacks.
>>: How do you run them side by side?
>> John John: Oh, this was like before and after.
>>: But two different IP's, right, at different times?
>> John John: No, at the same time, two different IP's.
>>: Two different IP's, but the IP's are right next to each other or something quick
>> John John: They are on the same domain; they're all in Washington EDU. And now
for the last part of the thing which should be pretty quick. So what happened after the
site was compromised, was that they host malware pages on there and then they spread
these links through either e-mails or IMs or in this case through search engines. So let
me give you a video example of how this works. So if you go type in a query in Google
which is a benign query, in this case Flintstone pictures, on MySpace and they happily
help you auto complete, search on the results. Click on the very first link, and that turns
out to be a compromised link, because now it shows a big pop up which is your computer
is infected. And you click okay. It scans your Windows drive, your C is infected, your B
is infected, you've got a whole bunch of things going on and now you have a choice of
either protecting your PC or ignoring it. No matter what you do it tries to download a file
and if you actually save and install the file you have now been compromised. So this is a
rather common form of social engineering which today is called a Scareware attacks.
>>: You're running that on a Mac.
>> John John: No, this is actually Windows. This is a video.
>>: [inaudible]
>> John John: So this, it runs on a Mac too; you get the same thing.
[laughter]
>> John John: Full-screen doesn't look quite so realistic, so I had to run it on Windows to
show you. Now is this really a problem? Well it turns out nearly half of the popular
search terms contain at least one malicious link in the top results, and that is quite bad.
And last year, just last year, this is sort of Scareware fraud cost nearly $150 million.
Now where is the money coming in? Well once you install this fake antivirus it runs in
your taskbar and every 30 days it pops up, a link saying your protection is running out,
please by the full version for $30. And it turns out at least 5 million people did fall for
that and paid $30 to buy it.
>>: So Scareware doesn't take over your machine and scan all your files and blackmail
you? It just tries to sell you something?
>> John John: So this one does not. But you do have things that…
>>: But it's software.
[laughter]
>>: It does an update.
[laughter]
>> John John: There is a similar thing which does Ransom Ware which would
essentially encrypt your C drive. And then ask you to wire over so much money before it
gives you your password.
>>: [inaudible] you mean the top 10?
>> John John: Top 50, top 50 yeah. And this was mostly a problem with Google not
with Bing.
>>: I feel like [inaudible] getting an oil change or something. Just this weird.
>> John John: So I guess they could use it also as a dropper, because once you install a
piece of software, it doesn't have to be just this fake antivirus; it could be anything of
their choice. For now this seems to be a good [inaudible] approach that they have stuck
to. So our goal here is to understand how these search engines are getting poisoned.
How is it possible for them [inaudible] poison atop a news result and get it to the very top
of the search results. For this example we looked at a sample attack; this was a real
attack that was going on in progress. And this contained in nearly 5000 compromised
web servers. And these have a very strongly connected cross domain link so you have
each of them pointing to 200 other sites and so on. And these guys once you click on any
of these results it redirects you to an actual exploit server which serves this Malware
exploit. And these were hosted on nearly 400 domains in the US and in Russia.
And one of the things we observed was that the log files on these servers was not very
well protected so we were able to access these logs and we could see which were the
different things that they were redirecting to and how many victims actually clicked
through and ordered. And over a 10 week we found that over 100,000 users had actually
clicked through to the final Scareware page.
>>: When you say they weren't well protected?
>> John John: They were not password protected. The log files were on the web server
and could be read. So we didn't have to do anything shady to access the files.
>>: [inaudible] is that part of the no crunch and no crawling like if [inaudible] was faulty
[inaudible] access [inaudible]
>> John John: This is a log stored by the attackers.
>>: The attackers decided to [inaudible] to make it a request where you don't have to
authenticate, so you just had to pretend to be them, you didn't have to do anything other
than just say may I have it?
>> John John: It was a capability-based. If you know the name of the file, you can
access it.
>>: It seems a bit optimistic that the attacker happened to store a log file?
>> John John: Yes.
>>: [inaudible] everyone access with no protection.
>>: What do they care?
[laughter]
>>: So those logs are separate on those 500 servers, right?
>> John John: Actually the log files are on a redirection service so all of these 5000
servers funnel into three servers which are responsible for redirecting into the final
server.
>>: And they verified those log files? So the public at the moment is that you have
interfaces or log files basically producing random uploads. So if you go there today you
know they will say 100,000 and, if you go there tomorrow from a different IP it was a…
>> John John: No. It was definitely verified [inaudible]. Every time I was at it my IP
and everything was [inaudible].
>>: Okay.
>> John John: So some of the prominent features for this attack was nearly 20,000
keywords were poisoned. And where did the attackers find these keywords? They come
from Google trends, and Google trends essentially each day Google produces a list of the
top 10, top 20 search terms. And these hackers take these 20 search terms, they push it
into Bing, and for each of these terms we get another 10 related search terms. And they
collect all of these things over several days and they have a huge set of keywords that
they can now poison. And more than 40 million pages were indexed and all this
happened in just 10 weeks.
So in order to detect these things we make use of the features that we found. One of the
features that we found is that they have a very dense link structure across them, hundreds
of sites linking to each other. There are many popular search terms in the URLs. Pretty
much anything related to Justin Bieber would show up there. Then there were a large
number of new pages. So once a server got compromised you will find that an attacker
suddenly hosts a thousand or 10,000 new pages on the server and that these new pages
are very similar across multiple domains since hackers typically use crypts that go out
attack and host these servers. They are very similar across these domains.
>>: There is a long history of [inaudible]
>> John John: Yes. This is not necessarily SeO; this is more towards only the ones
which are fully compromised sites. So SeO typically is for sites that are set up for this
purpose, but compromised sites you have a different view of the model. So they have a
particular behavior up to a certain time and once it gets compromised they completely
change phase and do something different.
>>: [inaudible] for a hacker to put on a compromising website into [inaudible] they will
use a similar attack [inaudible] so nonmalicious websites are kind of optimized
[inaudible] we use similar techniques, right? And then you can in that sense your
techniques can [inaudible] optimize compromised websites would be similar search
engine to the hacker as you optimize them [inaudible]
>> John John: Yes, but in our case it's probably a little easier, because there is a sudden
phase change after the compromise, so we can make use of that to determine which sites
have been compromised.
>>: I see you assume you have a before and after…
>> John John: So we have the weblogs from the historical information about each
website.
>>: I see.
>> John John: So it's actually easier than the normal SeO thing which search engines
have to do.
>>: I see, I see.
>> John John: And some quick results. We found nearly 15,000 URLs and 900 domains
corresponding to multiple compromised SeO campaigns. And there were, we picked 120
popular searches and we found that 43 of searches did in fact have compromised the
results. And 163 URLs were compromised. And to conclude today's landscape is rather
complex and we need a multipronged different strategy to address these various attacks.
So we use SearchAudit, deSIO, heatseeking honeypots and Botlab as defensive tools.
And we found that monitoring attackers often reveals new attacks and that infiltration is a
rather effective mechanism, but it has to be done carefully. And with that, let me just
quickly mention I am also on a bunch of non-security projects, so a couple of them are
Consensus Route language, which kind of does consistent routing on inter-domain level,
and also Hubble which is a system for studying rigidity problems in the internet and more
recently Keypad is a file system for theft prone devices. And if you want more
information on bots, you can go to Botlab.org. And that is it.
[applause]
>> John John: Yes?
>>: You alluded to it in the slide you had 120 searches on Google with 43 found with
malicious… What about Bing results?
>> John John: One. Bing had one as I recall.
>>: Are you going to publish that somewhere?
>> John John: Oh yeah, it was there in the…
>>: Yeah, because I mean the Bing guys have been working.
>> John John: It's there in the paper. For sure and yeah, Bing had only one malicious
[inaudible].
>>: What is the scale for [inaudible]
>> John John: For spam links. In our case we were looking only at sort of compromised
SeO.
>>: I guess this is a follow-up to parts of the solutions but [inaudible] features that you
detected identify cases of SPO [inaudible] seems to be necessary because the fact that
there are a large number of pages or a large number of new pages or pages that are very
similar, if somebody knew that by doing that they were going to get figured out they
would work around it. So it didn't seem necessary…
>> John John: A couple of necessary conditions there are the dense link structure
because finally, depends on the page rank algorithm.
>>: What if I just have some random links? To get rid of the dense link structure, even
that doesn't seem necessary.
>> John John: No, if you just look at the dense link structure across these things. If you
look at the completely connected or the strongly connected, components that is kind of
necessary in order to boost your [inaudible] you need a lot of incoming links.
>>: It kind of depends. If you only have 100 maybe you need strong links. Maybe if
you have 10,000 or 20,000 you would still boost up without being strongly connected on
your setup. All you have to do is get it down to the level [inaudible]-based sample
[inaudible]
>> John John: So yes, in that case it is not necessary, it is possible for them to move
things around but in terms of getting it out to the top very quickly they do need things
like, they need relevant information in the page, right. If you just focus on the top search
terms, that is sufficient for you to kind of look at the smaller side of web pages, and then
do your analysis on there. So one of the things which they found necessary was to be a
relevant search term, because there's not much historical information about these things.
Like it's really hard to get Bank of America out of the top spot, but something about say
the tsunami or something else is really, the search engines won't have any historic
information about these things and so it becomes easier to game the system for the short
term events.
>>: Another way to just I don't know if this would work but another way to distinguish
your apparent SeO is from companies that are actually trying to up their search engines is
if one case if you e-mail the site administrator they will say, oh my God, my sites are
hacking the other [inaudible] but that's the way it's supposed to look it's got a great
[inaudible] it's got all sorts of [inaudible] with that work or would the [inaudible]
>> John John: Most site owners do not respond, so one of the things that we had is once
you recognized a site had been compromised, any attempt to contact them could place
some sort of legal liability because in many cases they would be like, oh my God,
Microsoft attacked me.
[laughter]
>> John John: So if you send an e-mail from Microsoft and your site has been
compromised, that is kind of reaction they would have and so lawyers are like if you find
something just don't do anything.
>>: [inaudible]
[laughter]
>>: [inaudible] has like a crawler for websites [inaudible]
>> John John: Yes. So we may end up using the historical Bing web information.
>>: [inaudible] websites?
>>: That’s slander. [inaudible]
>>: So this is one for the audience, but I'm assuming that most people came here because
they are interested in botnets and research on that. Do we have an internal [inaudible] for
discussing this kind of thing? Should we? How many people are hip to be offered this
kind of discussion? Not that many, okay. Okay well, maybe we should just get together
whenever this is over. Share some notes and get something set up.
>> Srikanth Kandula: Let's thank the speaker one more time.
[applause]
Download