HEANet_2002

advertisement
Filtering Spam With
Justin Mason, SpamAssassin Project & Deersoft
http://SpamAssassin.org/
What Is Spam?
• Best description: "Unsolicited Bulk Email"
• In human terms: bulk e-mail you didn't
want, and didn't ask for
• Mailing lists, newsletters, "latest offers":
not spam, if you asked for them in the
first place
• Name courtesy of Monty Python: “spam,
spam, spam and spam”
Why Bother Filtering Spam?
• Seems to be about 30% to 60% of mail
traffic, and increasing
• Users are forced to waste time wading
through their inbox
– costs their employers money
• Impossible to unsubscribe
– “unsubscribe” addresses work only 37% of
the time, according to the FTC
• Legal retaliation not possible, yet
• Just plain irritating!
Spam Volume Is Increasing
(data from Brightmail.com)
Filtering: Homebrew Blacklists
• First round of "spam filters": internal
blacklists, maintained by in-house admin
staff
• Match addresses, and delete those from
known spammers
• Later, match "bad words" (Viagra, porn)
• Quite hard to configure; centralised; lots
of work to keep up to date
Filtering: DNS Blacklists
• Identify spam source computers by IP
address
• Allow mail system to look up a public
database on the internet as mail arrives
• Block the message, if its sender's
address is blacklisted
• Now at least 20 DNS blacklists, with
varying reliability
• Many false positives
– eircom.net's main mail server!
SpamAssassin Concepts
• Zero-configuration where possible
• Lots of rules to determine if a mail is
spam or not
– "Fuzzy logic": rules are assigned scores,
based on our confidence in their accuracy
– These are combined to produce an overall
score for each message
– If over a user-defined threshold, the mail
is judged as spam
• No one rule, alone, can mark a mail as
spam
SpamAssassin Concepts, pt.2
• Combines many systems for a "broadspectrum" approach:
– Detect forged headers
– Spam-tool signatures in headers
– Text keyword scanner in the message body
– DNS blacklists
– Razor, DCC (Distributed Checksum
Clearinghouse), Pyzor
• Spammers cannot aim to defeat 1 system;
the others will catch them out
Integration Into Mail Systems
• Wrote SpamAssassin with flexibility of
integration in mind
• Many have been written:
– Integration into Mail Transfer Agents
(sendmail, qmail, Exim, Postfix, Microsoft
Exchange)
– Integration into virus-scanner MTA plug-ins
(MIMEDefang, amavisd-new)
– IMAP/POP proxies and clients
– Commercial plug-ins for Windows clients
(Eudora, MS Outlook)
• And many more I don't know about!
Accuracy and False Positives
• The big issue with filtering to date:
– not just “how much spam does it catch?”
– but “how many legitimate mails get caught,
too?”
• Many systems do not pay attention to this
problem
– Some blacklists even use "false positives" as a
weapon against service providers selling to
spammers
• FPs are much worse than spam getting
through
– much more inconvenient to user
Evolving a Better Filter
• SpamAssassin assigns scores using a
genetic algorithm
– Given a big collection of human-classified
mail, determine what tests each mail
triggers
– Use this to "evolve" an efficient score set
– Exactly the kind of problem a genetic
algorithm is good at
– Allows "shotgun" rules to be scored low,
where they cannot do damage
False Positive Rate
• SpamAssassin is 98.5% accurate on our
test corpora, with default settings
– 0.6% false positives
– 91% of all spam caught correctly
– with network tests on, spam hit-rate
probably increases to about 93-95%
• Highest rate available among present
tools
• Tunable by the user -- reduce FPs by
increasing the threshold, ditto vice-versa
Effect of the Threshold Setting
What To Do When You've
Caught It
• Since classifiers are imperfect, blind
deletion is bad
• Better to mark the mails, and allow user
to check over them infrequently
• Also good to mark for legal reasons
– In the UK, it may be illegal to hold mail
(even spam) for more than 3 days
Features For Large-Scale Use:
"spamd"
• Client-server interface to SpamAssassin
• Pre-loads, so much faster for high
volumes
• Can load user preferences from an SQL
database
• Can load-balance -- uses TCP/IP
• Deployed at several large organisations
and ISPs: The Well, Salon.com, Panix,
Transmeta, SourceForge, Stanford
Large-Scale Filtering For Your
Network
•
•
•
•
Different from filtering for yourself
Many users get little spam
Should use conservative settings
Better to use “opt-out by default”
– notify that spam filtering is available, and
ask them if they want it
How Can Network
Administrators Fight Spam?
• Scan for Open Relays & Proxies on your
network
• Block proxy ports at the firewall
• Audit web servers for “FormMail” or other
insecure web-to-mail scripts
• Spam traps reporting to network
blacklists: Razor, DCC, Pyzor
• Run SpamAssassin, or SpamAssassin
Pro!
How Do The Spammers Feel?
• Already hurting, according to CBS:
– “[I’ve gone through] unbelievable
hardships [to keep spamming] ... My
operating costs have gone up 1,000% this
year, just so I can figure out how to get
around all these filters”
• Spam relies on low overheads and
extremely cheap delivery
• Disrupt the equation and they will give up!
Future Directions
• Learning filters (Bayesian probability etc.)
– Learn automatically, to detect what "good"
mail to your network looks like
• "Hash-cash"
– Sending mail currently more-or-less free
– With hash-cash, each recipient requires
CPU time for the sender
– SpamAssassin can provide "bonus points"
for hash-cash users
Fin
• http://spamassassin.org/
– SpamAssassin for UNIX
– (free software)
• http://www.deersoft.com/
– SpamAssassin Pro: MS Outlook, Exchange
– (commercial version)
– (my employers!)
Download