Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing

Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech 9/2003 1 >50% of all e-mail today is spam? 9/2003 Source: brightmail.com 2 Scale • IDC: of 31bn messages sent each day, 18%, or 5.6bn were s[pc]am messages • Brightmail decoy network stats: 6.7 bn spam messages sent in March, 2003, varying from 100 to ~100,000 identical e-mails sent at a time 9/2003 3 Current techniques to deal with SPAM/UCE: • • • • • • • Blacklisting Signature-based Filtering Statistical/Bayesian Filtering Heuristic Filtering Challenge-Response Filtering Sender-pays Laws 9/2003 4 Blacklisting • MAPS (Mail Abuse Prevention System) RBL catches only 24% of spam with 34% false positives (the spam police article, gaudi/gaspar) • Self-appointed sheriffs/vigilantes, legitimate business increasingly caught in crossfire, e.g. iBill was losing $100k/day during each of the four days of blacklisting • Only a first cut at the problem, never b-lists more than 50% of the servers sending spam (Graham) 9/2003 5 Sample and Signature-based Filtering • Set up a network of DECOY e-mail addresses. Any messages sent to these addresses must be spam=>if the same message is sent to a protected address, the message must be SPAM, too (that’s what Brightmail does) • Not very flexible -- spammers take the lead in coming up with tricks • Make each spam different 9/2003 6 Brightmail (used by MS/Hotmail, Earthlink, Verizon, ebay etc. ) 9/2003 7 Basic Statistical Filtering • • • • W: Must be TRAINED, S: relatively low false positives Starts with two message corpuses -- spam and legitimate Splits messages into TOKENs Assigns each token a probability, based on the probability of its appearance in spam corpus e.g. ‘naked’ may have 67% probability of appearing in spam, say vs. ‘regards’ -- 10% • when a new message arrives, stat filter takes top N tokens with the probability that is the farthest from the middle 50% both ways, applies Bayesian Theorem, and comes up with a RANKING for the e-mail 9/2003 8 Heuristic Filtering • What kind of filters can you come up with JUST BY LOOKING at a spam e-mail? • Sender name looks bogus? • Header fields are missing? • Lots of html? • Take all these rules and heuristic observations, assign weights/points, and put them into a database • You’ve got yourself an early version of SPAMASSASSIN 9/2003 9 SpamAssassin • The way you can make it work (let’s say with postfix): 1) perl -MCPAN -e ‘install Mail::SpamAssassin’ 2) learn on database of spam and legitimate e-mails using sa-learn (part of spamassassin) 3) add a filter program to filter all incoming mail through spamc, a part of spamassassin: /usr/bin/spamc | /usr/sbin/sendmail -i “$@”; exit $? 4) spamc adds headers, something like: X-Spam-Flag: {YES|NO}, X-Spam-Level: *** 5) The headers are caught by a user’s procmail recipe and mail is classified appropriately 9/2003 10 Heuristic Filtering Two • W: Public heuristic rules database; makes it relatively easy for spammers to come up with way to bypass the system => The rules database needs to be updated frequently • May not be as effective today as other methods, such as stat filtering 9/2003 11 Challenge-Response Filtering • Whenever you receive an e-mail from someone NOT on your whitelist, an automatic reply is sent telling what steps the sender should take to be considered for the whitelist (e.g. send you a confirmation, make a donation, solve a puzzle, etc.) • Very effective at stopping spam BUT has a number of drawbacks: valid mail delayed, kind of harsh -- some may think of it as inconsiderate and never reply, extra work for senders etc. 9/2003 12 Stats for different approaches (MessageLabs) MAPS/RBL Sample/ Statistical Heuristic Signature and Rulebased False negatives 40-100% 20% ~1%* 5% False positives 10% 2% 0.1%* 0.5% 9/2003 * See next slide 13 Problems with Statistical and other keyword-dependent methods • 1) Heavily dependent on effective parsing and the presence of “true” tokens, e.g. spammers fooling parsers: Examples: – White background: <font color=white>research data and other statistically strong keywords that are present in legitimate e-mails</font> – Splitting words: check this porn – Adding extra characters and spaces to confuse parsers (F*R E-E) and so forth (javascript, fake html tags, browser-specific tricks) 2) • 2) Spam may contain too little text and be TOO close to real e-mails in keywords. This is a more serious problem. I’ll give an example later. 9/2003 14 My research • Developed and implemented a system for filtering of unwanted mail using Google • Can be used WITHOUT training 9/2003 15 Classification of current spam 9/2003 16 Thoughts • Some users must click on those ads or else there would be no spam (somebody IS interested in it after all) • There may be more of such users in the future as new regulations appear and spam becomes less of an annoyance and more of an ad • Some users may like to receive SPAM-looking messages, for instance, marketing reports, offers, etc., that look very much like spam 9/2003 17 Two main observations I use • Spam is USER-SPECIFIC • Most spammers expect users to TAKE some ACTION upon reading spam; in other words, there has to be a FEEDBACK mechanism 9/2003 18 Targeting the feedback mechanism • How effective would a spam be without an easy feedback mechanism? 9/2003 19 URLs as a feedback mechanism • Of ~1800 spam messages in the classical spam corpuses I have analyzed, ~95% of messages contained URLs • Of the remaining 5%, approximately 1/2 seemed to be damaged submissions (i.e. MIME conversion and other types of errors), the rest consisted of two types of letters: – Messages with 1-800 numbers and faxes (including Nigerian scam) – Religious letters 9/2003 20 Basic Approach: URLSP • The basic approach was to extract URLs, apply a user-specific whitelist based on a user’s mailbox (masks such as .edu, cnn.com etc.) and classify everything else as spam • The first version I implemented has been in use at Tech since December’02 • Has actually been working quite well 9/2003 21 Effective but rather naive • First version effective but rather naive • Granularity and false positives can be a problem 9/2003 22 Next version: Classifying URLs • CLASSIFY URLs using Google and Open Directory • Use whitelists/blacklists of categories and URLs BASED on user mailbox and individual preferences 9/2003 23 DMOZ/ODP 9/2003 24 Example • Based on files automatically generated from your mailbox, configure the system as follows (blacklist* f. are omitted): whitelist.url: .edu, .mil, .gov, www.nmap.com, www.epic.org, www.cypherpunks.to etc. whitelist.cat: Top/Computers/Security/Anti_Virus/Products Top/Computers/Security/Products_and_Tools/Cryptography/PGP Top/Computers/Security/Products_and_Tools/Password_Tools ... 9/2003 25 URL Classifier: Categories Extracted from SPAM • Examples of categories of URLs extracted from spam: Top/Business/Consumer_Goods_and_Services/Beauty/Cosmetics Top/Business/Employment/Careers Top/Business/Financial_Services/Mortgages Top/Business/Investing/Day_Trading/Brokerages Top/Business/Investing/Day_Trading/Education_and_Training Top/Business/Investing/News_and_Media/Newsletters/Stocks_and_Bonds Top/Business/Marketing_and_Advertising/Direct_Marketing/Mailing_Lists/MLM Top/Regional/North_America/Canada/Business_and_Economy/Employment/Job_Sear ch Top/Shopping/Gifts/Personalized Top/Shopping/Home_and_Garden/Kitchen_and_Dining/Appliances/Parts ... 9/2003 26 GTUC v1.0 (Basic) • Register for a free account on a CoC-based filtering server • Forward your mail to the server • The mail will be automatically classified into three folders as it arrives – Inbox, Unknown, spam-can • Read your mail with IMAP 9/2003 27 Spam of the future • Innovative feedback mechanisms • Appearance as close to legitimate e-mails as possible, e.g. >>> From: rcarlos@legitimate.com Hi, here is an interesting article. You should check it out -- net::“terminator_25” Roberto Carlos 9/2003 28 Solution • Current best--Combination of approaches • Categorization and URL-based filtering can help • Uncategorized URLs? Similarity + retrieval of html and categorization with token stats/heuristics 9/2003 29

Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing

Related documents

Products

Support

Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib