Filtering Spam With Justin Mason, SpamAssassin Project & Deersoft http://SpamAssassin.org/ What Is Spam? • Best description: "Unsolicited Bulk Email" • In human terms: bulk e-mail you didn't want, and didn't ask for • Mailing lists, newsletters, "latest offers": not spam, if you asked for them in the first place • Name courtesy of Monty Python: “spam, spam, spam and spam” Why Bother Filtering Spam? • Seems to be about 30% to 60% of mail traffic, and increasing • Users are forced to waste time wading through their inbox – costs their employers money • Impossible to unsubscribe – “unsubscribe” addresses work only 37% of the time, according to the FTC • Legal retaliation not possible, yet • Just plain irritating! Spam Volume Is Increasing (data from Brightmail.com) Filtering: Homebrew Blacklists • First round of "spam filters": internal blacklists, maintained by in-house admin staff • Match addresses, and delete those from known spammers • Later, match "bad words" (Viagra, porn) • Quite hard to configure; centralised; lots of work to keep up to date Filtering: DNS Blacklists • Identify spam source computers by IP address • Allow mail system to look up a public database on the internet as mail arrives • Block the message, if its sender's address is blacklisted • Now at least 20 DNS blacklists, with varying reliability • Many false positives – eircom.net's main mail server! SpamAssassin Concepts • Zero-configuration where possible • Lots of rules to determine if a mail is spam or not – "Fuzzy logic": rules are assigned scores, based on our confidence in their accuracy – These are combined to produce an overall score for each message – If over a user-defined threshold, the mail is judged as spam • No one rule, alone, can mark a mail as spam SpamAssassin Concepts, pt.2 • Combines many systems for a "broadspectrum" approach: – Detect forged headers – Spam-tool signatures in headers – Text keyword scanner in the message body – DNS blacklists – Razor, DCC (Distributed Checksum Clearinghouse), Pyzor • Spammers cannot aim to defeat 1 system; the others will catch them out Integration Into Mail Systems • Wrote SpamAssassin with flexibility of integration in mind • Many have been written: – Integration into Mail Transfer Agents (sendmail, qmail, Exim, Postfix, Microsoft Exchange) – Integration into virus-scanner MTA plug-ins (MIMEDefang, amavisd-new) – IMAP/POP proxies and clients – Commercial plug-ins for Windows clients (Eudora, MS Outlook) • And many more I don't know about! Accuracy and False Positives • The big issue with filtering to date: – not just “how much spam does it catch?” – but “how many legitimate mails get caught, too?” • Many systems do not pay attention to this problem – Some blacklists even use "false positives" as a weapon against service providers selling to spammers • FPs are much worse than spam getting through – much more inconvenient to user Evolving a Better Filter • SpamAssassin assigns scores using a genetic algorithm – Given a big collection of human-classified mail, determine what tests each mail triggers – Use this to "evolve" an efficient score set – Exactly the kind of problem a genetic algorithm is good at – Allows "shotgun" rules to be scored low, where they cannot do damage False Positive Rate • SpamAssassin is 98.5% accurate on our test corpora, with default settings – 0.6% false positives – 91% of all spam caught correctly – with network tests on, spam hit-rate probably increases to about 93-95% • Highest rate available among present tools • Tunable by the user -- reduce FPs by increasing the threshold, ditto vice-versa Effect of the Threshold Setting What To Do When You've Caught It • Since classifiers are imperfect, blind deletion is bad • Better to mark the mails, and allow user to check over them infrequently • Also good to mark for legal reasons – In the UK, it may be illegal to hold mail (even spam) for more than 3 days Features For Large-Scale Use: "spamd" • Client-server interface to SpamAssassin • Pre-loads, so much faster for high volumes • Can load user preferences from an SQL database • Can load-balance -- uses TCP/IP • Deployed at several large organisations and ISPs: The Well, Salon.com, Panix, Transmeta, SourceForge, Stanford Large-Scale Filtering For Your Network • • • • Different from filtering for yourself Many users get little spam Should use conservative settings Better to use “opt-out by default” – notify that spam filtering is available, and ask them if they want it How Can Network Administrators Fight Spam? • Scan for Open Relays & Proxies on your network • Block proxy ports at the firewall • Audit web servers for “FormMail” or other insecure web-to-mail scripts • Spam traps reporting to network blacklists: Razor, DCC, Pyzor • Run SpamAssassin, or SpamAssassin Pro! How Do The Spammers Feel? • Already hurting, according to CBS: – “[I’ve gone through] unbelievable hardships [to keep spamming] ... My operating costs have gone up 1,000% this year, just so I can figure out how to get around all these filters” • Spam relies on low overheads and extremely cheap delivery • Disrupt the equation and they will give up! Future Directions • Learning filters (Bayesian probability etc.) – Learn automatically, to detect what "good" mail to your network looks like • "Hash-cash" – Sending mail currently more-or-less free – With hash-cash, each recipient requires CPU time for the sender – SpamAssassin can provide "bonus points" for hash-cash users Fin • http://spamassassin.org/ – SpamAssassin for UNIX – (free software) • http://www.deersoft.com/ – SpamAssassin Pro: MS Outlook, Exchange – (commercial version) – (my employers!)