Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing

advertisement
Classifying and Filtering Spam
Using Search Engines
Oleg Kolesnikov
College of Computing
Georgia Tech
9/2003
1
>50% of all e-mail today is
spam?
9/2003
Source: brightmail.com
2
Scale
• IDC: of 31bn messages sent each day, 18%,
or 5.6bn were s[pc]am messages
• Brightmail decoy network stats:
6.7 bn spam messages sent in March, 2003, varying from
100 to ~100,000 identical e-mails sent at a time
9/2003
3
Current techniques to deal with
SPAM/UCE:
•
•
•
•
•
•
•
Blacklisting
Signature-based Filtering
Statistical/Bayesian Filtering
Heuristic Filtering
Challenge-Response Filtering
Sender-pays
Laws
9/2003
4
Blacklisting
• MAPS (Mail Abuse Prevention System) RBL
catches only 24% of spam with 34% false
positives (the spam police article, gaudi/gaspar)
• Self-appointed sheriffs/vigilantes, legitimate
business increasingly caught in crossfire, e.g. iBill
was losing $100k/day during each of the four days
of blacklisting
• Only a first cut at the problem, never b-lists more
than 50% of the servers sending spam (Graham)
9/2003
5
Sample and Signature-based
Filtering
• Set up a network of DECOY e-mail addresses.
Any messages sent to these addresses must be
spam=>if the same message is sent to a protected
address, the message must be SPAM, too (that’s
what Brightmail does)
• Not very flexible -- spammers take the lead in
coming up with tricks
• Make each spam different
9/2003
6
Brightmail (used by MS/Hotmail,
Earthlink, Verizon, ebay etc. )
9/2003
7
Basic Statistical Filtering
•
•
•
•
W: Must be TRAINED, S: relatively low false positives
Starts with two message corpuses -- spam and legitimate
Splits messages into TOKENs
Assigns each token a probability, based on the probability
of its appearance in spam corpus
e.g. ‘naked’ may have 67% probability of appearing in
spam, say vs. ‘regards’ -- 10%
• when a new message arrives, stat filter takes top N tokens
with the probability that is the farthest from the middle
50% both ways, applies Bayesian Theorem, and comes up
with a RANKING for the e-mail
9/2003
8
Heuristic Filtering
• What kind of filters can you come up with JUST
BY LOOKING at a spam e-mail?
• Sender name looks bogus?
• Header fields are missing?
• Lots of html?
• Take all these rules and heuristic observations,
assign weights/points, and put them into a
database
• You’ve got yourself an early version of
SPAMASSASSIN
9/2003
9
SpamAssassin
• The way you can make it work (let’s say with postfix):
1) perl -MCPAN -e ‘install Mail::SpamAssassin’
2) learn on database of spam and legitimate e-mails using
sa-learn (part of spamassassin)
3) add a filter program to filter all incoming mail through
spamc, a part of spamassassin:
/usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $?
4) spamc adds headers, something like:
X-Spam-Flag: {YES|NO}, X-Spam-Level: ***
5) The headers are caught by a user’s procmail recipe and
mail is classified appropriately
9/2003
10
Heuristic Filtering Two
• W: Public heuristic rules database; makes it
relatively easy for spammers to come up
with way to bypass the system => The rules
database needs to be updated frequently
• May not be as effective today as other
methods, such as stat filtering
9/2003
11
Challenge-Response Filtering
• Whenever you receive an e-mail from someone
NOT on your whitelist, an automatic reply is sent
telling what steps the sender should take to be
considered for the whitelist (e.g. send you a
confirmation, make a donation, solve a puzzle,
etc.)
• Very effective at stopping spam BUT has a
number of drawbacks: valid mail delayed, kind of
harsh -- some may think of it as inconsiderate and
never reply, extra work for senders etc.
9/2003
12
Stats for different approaches
(MessageLabs)
MAPS/RBL Sample/ Statistical Heuristic
Signature
and Rulebased
False
negatives
40-100%
20%
~1%*
5%
False
positives
10%
2%
0.1%*
0.5%
9/2003
* See next slide
13
Problems with Statistical and other
keyword-dependent methods
• 1) Heavily dependent on effective parsing and the presence of “true”
tokens, e.g. spammers fooling parsers:
Examples:
– White background:
<font color=white>research data and other statistically strong
keywords that are present in legitimate e-mails</font>
– Splitting words:
ch<!-- valid -->eck this p<!-- news -->orn
– Adding extra characters and spaces to confuse parsers (F*R E-E)
and so forth (javascript, fake html tags, browser-specific tricks) 2)
• 2) Spam may contain too little text and be TOO close to real e-mails in
keywords. This is a more serious problem. I’ll give an example later.
9/2003
14
My research
• Developed and implemented a system for
filtering of unwanted mail using Google
• Can be used WITHOUT training
9/2003
15
Classification of current spam
9/2003
16
Thoughts
• Some users must click on those ads or else there
would be no spam (somebody IS interested in it
after all)
• There may be more of such users in the future as
new regulations appear and spam becomes less of
an annoyance and more of an ad
• Some users may like to receive SPAM-looking
messages, for instance, marketing reports, offers,
etc., that look very much like spam
9/2003
17
Two main observations I use
• Spam is USER-SPECIFIC
• Most spammers expect users to TAKE some
ACTION upon reading spam; in other
words, there has to be a FEEDBACK
mechanism
9/2003
18
Targeting the feedback
mechanism
• How effective would a spam be without an
easy feedback mechanism?
9/2003
19
URLs as a feedback mechanism
• Of ~1800 spam messages in the classical spam
corpuses I have analyzed, ~95% of messages
contained URLs
• Of the remaining 5%, approximately 1/2 seemed
to be damaged submissions (i.e. MIME conversion
and other types of errors), the rest consisted of two
types of letters:
– Messages with 1-800 numbers and faxes
(including Nigerian scam)
– Religious letters
9/2003
20
Basic Approach: URLSP
• The basic approach was to extract URLs,
apply a user-specific whitelist based on a
user’s mailbox (masks such as .edu,
cnn.com etc.) and classify everything else
as spam
• The first version I implemented has been in
use at Tech since December’02
• Has actually been working quite well
9/2003
21
Effective but rather naive
• First version effective but rather naive
• Granularity and false positives can be a
problem
9/2003
22
Next version: Classifying URLs
• CLASSIFY URLs using Google and Open
Directory
• Use whitelists/blacklists of categories and URLs
BASED on user mailbox and individual
preferences
9/2003
23
DMOZ/ODP
9/2003
24
Example
• Based on files automatically generated from
your mailbox, configure the system as
follows (blacklist* f. are omitted):
whitelist.url:
.edu, .mil, .gov, www.nmap.com, www.epic.org,
www.cypherpunks.to etc.
whitelist.cat:
Top/Computers/Security/Anti_Virus/Products
Top/Computers/Security/Products_and_Tools/Cryptography/PGP
Top/Computers/Security/Products_and_Tools/Password_Tools
...
9/2003
25
URL Classifier: Categories
Extracted from SPAM
• Examples of categories of URLs extracted from spam:
Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics
Top/Business/Employment/Careers
Top/Business/Financial_Services/Mortgages
Top/Business/Investing/Day_Trading/Brokerages
Top/Business/Investing/Day_Trading/Education_and_Training
Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds
Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM
Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Sear
ch
Top/Shopping/Gifts/Personalized
Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts
...
9/2003
26
GTUC v1.0 (Basic)
• Register for a free account on a CoC-based
filtering server
• Forward your mail to the server
• The mail will be automatically classified
into three folders as it arrives
– Inbox, Unknown, spam-can
• Read your mail with IMAP
9/2003
27
Spam of the future
• Innovative feedback mechanisms
• Appearance as close to legitimate e-mails as
possible, e.g.
>>>
From: rcarlos@legitimate.com
Hi, here is an interesting article. You should check it
out -- net::“terminator_25”
Roberto Carlos
9/2003
28
Solution
• Current best--Combination of approaches
• Categorization and URL-based filtering can
help
• Uncategorized URLs? Similarity + retrieval
of html and categorization with token
stats/heuristics
9/2003
29
Download