IMPROVING ANTISPAM TECHNIQUES BY EMBRACING PATTERN-BASED FILTERING HAIRUL ANUAR BIN MAT NOR

advertisement
IMPROVING ANTISPAM TECHNIQUES BY EMBRACING PATTERN-BASED
FILTERING
HAIRUL ANUAR BIN MAT NOR
UNIVERSITI TEKNOLOGI MALAYSIA
IMPROVING ANTISPAM TECHNIQUES BY EMBRACING PATTERN-BASED
FILTERING
HAIRUL ANUAR BIN MAT NOR
A project report submitted in partial fulfilment of the
requirements for the award of the degree of
Master of Science (Information Security)
Centre for Advanced Software Engineering
Faculty of Computer Science and Information System
Universiti Teknologi Malaysia
APRIL 2009
iii
Dedicated to my beloved wife and kids
iv
ACKNOWLEDGEMENT
In preparing this thesis, I received guidance and encouragement from my
supervisor, Dr. Rabiah Ahmad, without whom I might not be able to complete this
thesis.
I extended my appreciation to all of my colleagues and others who have
provided assistance and support.
Thank you to my beloved wife and kids for their support.
v
ABSTRACT
This study attempted to show that there are still away to improve antispam
system. The classical method of filtering spam is by inspecting content of an e-mail
and finding a matching pattern with a predefined ruleset. Each matched keyword or
sentence will produce a weight, also called score, which will be combined to produce
the final score and later to be used for identifying spam similarities in the message.
Spammers keep changing a style in generating a spam message to avoid being
filtered. Bayesian technique was found to be suitable to embed in the antispam in
order to recognize the characteristic of spam in a message eventhough the content
has been changed. However, the Bayesian introduced difficulties to the system as
spammers have changed the way they send the spam specifically to bypass Bayesian
filter. Thus, it is time to find a way to filter those spam. Normally an antispam works
on the content, but there is a possibility to filter spam based on its pattern of
delivering the spam at network level to reduce the congestion in the network. Filter
at network level is also benefiting the server as it has eliminated some spam before
they are received and processed for the content. A study were conducted to show
above statement is true. An Antispam with Pattern Based Filter (ASPBF) will be
tested and the result will be compared with the test for Antispam with Bayesian. The
comparison result will determine how much it has achieved its objectives. This study
will be able to stimulate more studies in the future to further improve antispam
solution in the fight against spam and to have a better e-mail communications.
vi
ABSTRAK
Kajian ini dilakukan untuk membuktikan bahawa masih terdapat peluang
untuk memperbaiki sistem antispam. Cara biasa mengatasi masalah spam adalah
dengan memeriksa kandungan emel dan mencari kesamaannya dengan himpunan
maklumat kandungan spam yang tersedia ada. Ini telah memaksa para pembuat spam
mengubah cara mereka mengeluarkan spam bagi memastikan ia tidak ditapis.
Susulan daripada itu, para penyelidik mendapati teknik yang digunakan Bayesian
adalah sesuai untuk menjejaki emel yang mengandungi ciri-ciri spam walaupun isi
kandungan e-mel tersebut diubah bagi mengaburi antispam. Tetapi, teknik Bayesian
didapati tidak lagi releven malah telah mengakibatkan keburukan yang lain terutama
sekali bila para pembuat spam mengubah kandungan emel semata-mata untuk
mengaburi teknik Bayesian. Sungguhpun begitu, spam masih boleh dikenal pasti
melalui cara ia dihantar, oleh itu mengecam corak spam dihantar pada peringkat
rangkaian mampu mengekang kepadatan rangkaian dan mampu mempertingkatkan
proses antispam kerana kebanyakan e-mel telah disekat dan tinggal sebilangan kecil
sahaja yang akan diproses oleh antispam yang mengkaji kandungan e-mail. Kajian
ini juga membangunkan satu aplikasi antispam, dinamakan Antispam with PatternBased Filter (ASPBF), berupaya mengecam corak penghantaran spam dan
memutuskan sambungan spam tersebut. Hasil ujian ini akan membuktikan ia telah
berjaya
memenuhi
objektifnya
dalam
mempertingkatkan
kelancaran
dan
keberkesanan antispam. Hasil ujian ke atas aplikasi tersebut dan aplikasi antispam
dengan Bayesian akan dibandingkan bagi menunjukkan keberkesanan aplikasi
tersebut. Kajian ini diharapkan bakal menjadi perangsang kepada kajian-kajian lain
dalam mempertingkatkan kemampuan antispam bagi memperoleh dunia komunikasi
e-mel yang lebih baik.
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENT
vii
LIST OF TABLES
x
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
xii
INTRODUCTION
1
1.1
Introduction
1
1.2
Evolution of Spam and Ways to Filter
1
1.3
Problem Statement
3
1.3.1
Limitations of Bayesian
3
1.3.1.1.
Image Spam
5
1.3.1.2.
Open Relay and Proxies
6
1.3.1.3.
Botnets
7
1.3.1.4.
Spam with Dynamic Contents
8
1.3.2
Available Solution in Resolving Issues with
8
Bayesian
1.4
Definition of Patterns
9
viii
2
3
1.5
Objective
10
1.6
Scope
12
1.7
Hardware Specification for Simulation
13
1.8
Summary
13
LITERATURE REVIEW
14
2.1
Introduction
14
2.2
Antispam Solutions without Bayesian
14
2.2.1
Image Spam
15
2.2.2
Botnets and Network Level Protection
17
2.2.3
Dynamic Contents
23
2.3
Antispam with Bayesian
26
2.4
Summary
27
METHODOLOGY
29
3.1
Introduction
29
3.2
Methodology
29
3.2.1
Traffic Shaping
30
3.2.2
Sender Reputation
31
3.2.3
Sender Policy Framework
31
3.2.4
DNS Blacklist (DNSBL)
33
3.2.5
Recipient Validation
34
3.3
Improvement To Expect
34
3.4
Sample Data
35
3.5
Requirement for the simulation
36
3.5.1
Development Tool
36
3.5.2
Prototyping
37
3.5.3
Implementation
54
3.5.4
Simulation Environment
54
3.6
Test Methodology
55
3.7
Summary
56
ix
4
5
6
RESULT AND ANALYSIS
60
4.1. Introduction
60
4.2. Result
60
4.3. Analysis
66
4.4. Conclusion
67
DISCUSSION
68
5.1
Introduction
68
5.2
Discussion
68
5.3
Conclusion
69
CONCLUSION
70
6.1
Introduction
70
6.2
Conclusion Remarks
70
6.3
Future Work
71
REFERENCES
73
x
LIST OF TABLES
TABLE NO.
TITLE
PAGE
1.1
Specification of The Computer For The Simulation
13
2.1
Comparison of Methods to Detect Image Spam
17
2.2
Comparison of Network-Level Protection Techniques
22
2.3
Comparison of Techniques to Cater For Dynamic
26
Contents
2.4
Recommended Combination for Pattern-Based
27
Filtering Technique
2.5
Recommended Combination for Pattern-Based
28
Filtering Technique (Continue)
4.1.
Result of ASPBF
64
4.2.
Result of Antispam without ASPBF
65
xi
LIST OF FIGURES
FIGURE NO.
TITLE
PAGE
1.1
Sample of Image Spam
5
1.2
Architecture of EMP
8
2.1
Sample of Distorted Text in Image
15
2.2
Examples of spam images
16
3.1
System Architecture (part 1)
32
3.2
System Architecture (part 2)
33
3.3
Function of Traffic Shaping (Part 1)
37
3.4
Function of Traffic Shaping (Part 2)
38
3.5
Function of Sender Reputation (Part 1)
38
3.6
Function of Sender Reputation (Part 2)
39
3.7
Function of Sender Reputation (Part 3)
40
3.8
Function of Sender Reputation (Part 4)
41
3.9
Function of Sender Reputation (Part 5)
42
3.10
Function of Sender Reputation (Part 6)
43
3.11
Function of Sender Reputation (Part 7)
44
3.12
Function of Sender Policy Framework (Part 1)
45
3.13
Function of Sender Policy Framework (Part 2)
46
3.14
Function of Sender Policy Framework (Part 3)
47
3.15
Function of Sender Policy Framework (Part 4)
48
3.16
Function of DNS Blacklist (Part 1)
49
3.17
Function of DNS Blacklist (Part 2)
50
3.18
Function of DNS Blacklist (Part 3)
51
xii
3.19
Function of DNS Blacklist (Part 4)
52
3.20
Function of DNS Blacklist (Part 1)
52
3.21
Function of DNS Blacklist (Part 2)
53
3.22
Function of DNS Blacklist (Part 3)
54
3.23
Placement of the test components
56
4.1
Logs For DNSBL Filter
59
4.2
Logs For SPF Filter
60
4.3
Logs For Recipient Validation Filter
61
4.4
Logs For Sender Reputation Filter
62
4.5
Logs For Traffic Shaping Filter
63
4.6
Result of Filter with ASPBF
64
4.7
Result of Filter Without ASPBF
65
5.1
Expected outcome from the combination of ASPBF
69
and Bayesian
xiii
LIST OF ABBREVIATIONS
ASPBF
-
Antispam with Pattern-Based Filter
Botnet
-
Robot Network
CNC
-
Cloudmark Network Classifier
CPU
-
Central Processing Unit
DCOM
-
Distributed Component Object Model
DKIM
-
DomainKeys Identified Mail
DNS
-
Domain Name Service
DNSBL
-
Domain Name Service Black Listing
E-Mail
-
Electronic Mail
EMP
-
Extensible Messagin Platform
IP
-
Internet Protocol
LDAP
-
Lightweight Directory Access Protocol
LSASS
-
Local Security Authentication Server
OCR
-
Optical Character Recognition
RAM
-
Random Access Memory
SMTP
-
Simple Mail Transfer Protocol
SPF
-
Sender Policy Framework
SVM
-
Support Vector Machine
CHAPTER 1
INTRODUCTION
1.1
Introduction
This chapter forms the introduction to the thesis. Discussions begin with
emphasis on the history of spam and ways taken to fight against spam, including the
use of Bayesian technique, which is later found to be causing another problem.
1.2
Evolution of Spam and Ways to Filter
The growth of technologies especially those related to internet usage has
gradually been producing side effect that are damaging one’s productivity and image
when a precaution steps are not taken seriously. One of it is spam, less internet users
aware the severity of damage from spam. In addition, there are many users do not
know anything about spam, yet faced it in daily business.
Zdziarski (2005) claimed that many people described spam as unwanted emails, while others defined it as “unsolicited commercial e-mail” which covered
advertisements for products and other types of solicitations. Eventually, the meaning
was leaning towards “unsolicited bulk e-mail” and Zdziarski (2005) suggested it was
more acceptable.
It is commonly known that spam is sent in bulk intended to reach as many
people as it could to deliver messages that are meant to solicit. Therefore, the
beginning of mass spam distribution could be traced back to year 1994.
2
Lawyers Laurence Canter and Martha Siegel implemented the first large-
scale spam in 1994, especially when they sent a message offering help with the
Immigration and Naturalization Service’s “green-card lottery” to thousands of
Usenet newsgroups. Thousands of angry recipients sent nasty messages to
NETCOM, Canter and Siegel’s Internet Service Provider, causing the provider’s
machine to overload and crash (Foxman and Schiano, 2000).
Since then, people started to protect their e-mail servers from spam, thus
creating a huge demand on antispam products. This had motivated companies to
produce antispam solutions in rapid mode by taking advantage from readily available
antispam software developed by open source community.
One of the popular open source antispams is called SpamAssassin which was
developed by using Perl language and the implementation was meant for UNIX or
Unix-like environment.
Schwartz (2004) described SpamAssassin as a software for analyzing e-mail
messages, determining how likely they were to be spam and reporting its
conclusions. It was a rule-based system that compared different parts of e-mail
messages with a large set of rules. Each rule would add or remove points from the
message’s spam score. A message with a high enough score would be reported as a
spam.
There are numbers of rules embedded within SpamAssassin which contribute
scores to the final decision. Each rule has its own weight where the most effective
rules will be given more weight. However, users are also given chances to modify
the score of each rule or to add new rule sets.
SpamAssassin combines message format validation, content-filtering and the
ability to consult network-based blacklists (Schwartz, 2004). These are the typical
ways to filter spam. In general as described by Mendez (2008), there are two kinds
of spam filtering techniques: (i) collaborative system and (ii) content-based
approaches. The former are based on sharing identifying information about spam
messages within a filtering community, while content-based approaches analyzed
message content in order to identify its class (usually using machine learning
techniques)
3
In the early age, spam could be found in a typical way. However, as time
goes by spammers were getting more creative and they produced spam in different
ways making the spam more dynamic and less chances to be filtered by antispam.
Thus, artificial intelligence was needed in order to countermeasure the dynamic
spam attacks. That was why Bayesian techniques were embedded with
SpamAssassin to improve the catch rate.
Bayesian Filter in SpamAssassin is one of the most effective techniques for
filtering spam. It is a mathematical statistical analysis. Bayesian analysis involves
teaching a system that a particular input gives a particular result. For spam filtering,
this teaching is repeated, many times over with many spam and ham (legitimate) emails. Once this is finished, a Bayesian system can be presented with a new e-mail
and will give a probability of the result being spam. However, for best results,
teaching should be a constant process (McDonald, 2004).
1.3
Problem Statement
As enhancements are continuously introduced to antispam solutions and Bayesian
was always the main focus for its capabilities to detect spam using probabilistic
approach. However, amount of traffic and amount of spam increased over time
causing Bayesian technique to undergo performance degradation. In relation to
Bayesian technique, this section will primarily discover the limitations of Bayesian
technique.
1.3.1
Limitation of Bayesian
Bayesian technique used in antispam is primarily meant for probability and
similarity of sentences and keywords with actual spam. Therefore, Bayesian can only
be used for spam that is readable by a machine, which means words are in correct
spelling sequence.
4
A problem appeared when spammers started to use non-readable words,
where other characters are introduced in the middle of the spelling. Machines will
face problem recognising the words, but human eye will be able to identify the
words as the words and the unknown characters are differentiated by colours or by
any other mean. This is where Bayesian is gradually failing its operations.
More unfortunate to Bayesian filter, spam attacks not only can slip through
the filter due to the unreadable form of message, latest research has shown that
readable form of message also can defeat the Bayesian filter.
Research by Karlberger et al. (2007) had shown that existing attack aiming at
Bayesian is to let spam mails be identified as ham mails. Currently existing attacks
aim to achieve this by adding words to the spam mails. The objective is that these
additional words are used in the classification of the mail in the same way as the
original words, thereby tampering with the classification process and reducing the
overall spam probability. When the additional words are randomly chosen from a
large set of words, this is called random word attack, also called “word salad”. The
objective is that the spam probabilities of words added to the spam message should
compensate for the original tokens’ high spam probabilities in the calculation of the
whole message’s combined spam probability.
This has shown that spammers has turned significant energy to building
software which cloaks their spam to hide the trigger words and phrases from
Bayesian filters, the resulting messages become progressively simpler for pattern
based scanners to correctly identify as spam. This is because pattern based scanners
can identify spam by actually targeting the cloaking techniques the spammers are
using, rather than worrying about identifying spam based on words and phrases.
There are significant patterns that can be recognised from a spam e-mail as
spam is normally sent in bulk targeting a group of people. Therefore, there are many
similar spams are flowing around in the internet. Thus, it is a good idea to take
advantage on this characteristic to enhance the filtering process.
However, not just pattern in the spam e-mails are always similar, but
techniques of sending the spam also found to be similar. This has increased
5
awareness among antispam developer to look at filtering the spam before the spam
actually reach the e-mail server, thus saving a lot of processing power.
These two types of pattern are not the centre of Bayesian’s focus even though it can
be of assistance to an antispam to perform better. In the event some spams will slip
through the pattern-based filter, Bayesian can be used as the last line of defence.
1.3.1.1 Image Spam
Image spam is another form of spam due to its effective ways to deceive
spam filters since they can only process text (Mehta et al. 2008). Images are hard to
process for identifying spam as they only contain non-readable characters, which can
only be interpreted by human’s eye. Figure 1.1 below show one example of image
spam.
Figure 1.1: Sample of Image Spam
Although Optical Character Recognition (OCR) was introduced to extract
letters and keywords from an image, but it has given a big impact on overall
processing load.
Mehta et al. (2008) also emphasis that images in spam e-mails contain text
messages conveying the intent of the spammer. This text is usually an advertisement
and often contains text, which has been blacklisted by spam filter.
6
Since legitimate e-mails may also contain images in the attachment, therefore
it is hard to differentiate legitimate e-mail and spam e-mail if the detection cannot
filter spam in an image. Besides, detecting characteristic of the e-mail with image
spam may not the best solution as the characteristics are almost similar to the
legitimate e-mails.
In further research by Baskhar et al. (2008), they found the sophisticated
techniques like Optical Character Recognition (OCR) can fail to recognize text in
images due to the distinction of every image spam, where more noises where added,
random background, colour of the fonts and many more.
1.3.1.2 Open relay and proxies
Open relay is a method to provide a convenience way to end users enabling
them to send e-mails without any authentication that require manual intervention
each time. This method also simplifies communication between internal server and
the upstream server that connect to the internet.
However, open relays have been exploited by spammers due to the
anonymity and amplification offered by the extra level of indirection. It appears that
the widespread deployment and use of blacklisting techniques have all but
extinguished the use of open relays and proxies to send spam (Ramachandran and
Feamster, 2006).
Although smtp authentication is widely used nowadays to prevent
unauthorized use of the e-mail server, but there are still a lot of e-mail providers that
are not taking serious action to overcome this issue as it does not really affect their
servers.
That is not the only problem, but spammers are developing their smtp server,
a tiny version, in order to find a simpler way to broadcast the e-mail rather than
finding an open relay servers which may take time to discover. This tiny smtp server
is then embedded into botnet, therefore they will have a large distribution of open
relay servers that are ready for use at anytime.
7
1.3.1.3 Botnets
Today many people claimed that most of the spams are coming from botnet,
where botnet is an interpretation of a collection of machines acting under one
centralized controller. Normally a machine might get affected by worms that
exploiting Windows vulnerabilities opening backdoor to allow remote attacker to
take control of the infected machines.
As mentioned earlier, the botnets might have a tiny smtp server application,
thus whenever the backdoor is opened to the attackers, they will have a direct access
to the smtp server to launch spam attacking everyone in the world. This will hide the
real spammers, but put the blame on the innocence.
But not necessarily the botnets to have its own smtp server, it can exploit
vulnerabilities in the windows machine to act as smtp server. As mentioned by
Ramachandran and Feamster (2006), worms exploiting DCOM and LSASS
vulnerabilities on windows systems allows infected hosts to be used as a mail relay,
and attempts to spread itself to other machines affected by aforementioned
vulnerabilities, as well as over e-mail.
1.3.1.4 Spam With Dynamic Contents
Content-based filters, such as those incorporated by popular spam filter like
SpamAssassin, successfully reduce the amount of spam that actually reaches a user’s
inbox. On the other hand, content-based filtering has drawbacks. Users and system
administrators must continually update their filtering rules and use large corpuses of
spam for training; in response, spammers is negligible, since spammers can easily
alter content to attempt to evade these filters (Ramachandran and Feamster, 2006).
Although Bayesian technique was initially found to show high performance
precision and recall, it has several problems. First, it has a cold start problem, that is
training phase has to be done before execution of the system, the system has to be
8
trained about spam and non-spam mail. Second, the cost of spam mail filtering is
higher than rule-based system. Third, if an e-mail has only few terms those represent
its contents, the filtering performance is fallen (Kim et al, 2004).
In that sense, the Bayesian works based on similarities to the previously
found spam and non-spam. However, the newly designed spam may not be
recognized by Bayesian, thus require a human intervention to provide an input for retraining and to get familiar with the latest content. But, while waiting for a human
input, damages might have been done and reputation of the e-mail providers might
get poorer.
1.3.2
Available Solution in Resolving Issues with Bayesian
This study was inspired by a solution commercially made available, namely
Extensible Messaging Platform (EMP), which is focusing on recognising spam from
its pattern and signature. Although the solution is ready off-the-shelf, but there are
some characteristics that may affect performance and reliability.
Figure 1.2 below shows the architecture of the EMP solutions:
Figure 1.2: Architecture of EMP
9
From the given architecture, there are some factors that may lead to a
decreasing performance:
i.
The technique used to do sender reputation is based on blacklist database
which information are collected and shared from various input all over the
world, thus requesting the information in real-time requires a reliable
connections. This is a very resource intensive process especially when the
amount of the traffic is too huge.
ii.
The EMP is acting as an smtp software which create more load to the server
as it needs to run two instances of different smtp software.
iii.
The signature of spam shall be downloaded from the central repository in
order to make the pattern-based filter to function properly, it would be better
if the machine is self-dependent where it shall develop its own database of
signature from the observation and analysis on the incoming traffic.
Although a commercial product is available to provide pattern-based filter,
this study will look deeper into ensuring the product to maintain its performance for
a longer period of time even though the amount of traffic increases.
Besides, this is an academic study to appreciate researches by earlier
researchers for their contribution in improving antispam solution. This study shall be
more valuable and acceptable when each of the proposed technique in the past
research is put into a test.
1.4
Definition of Pattern
The use of ‘pattern’ is referring to the following definitions which depend on
which protection level it may be used:
i.
At network level
At this level, the pattern referring to behaviour of an IP address in requesting
for SMTP connections. Some of the patterns are:
ii.
10
•
The large amount of traffic in a short period of time.
•
Number of invalid recipient from an IP address in a short time.
•
Invalid IP address or IP address without specific DNS record.
At e-mail content level
At this level, the some characteristic, especially those listed below will be
categorized as pattern:
•
Invalid timestamp.
•
Invalid header information.
•
Keywords or mix of keywords matched with database of spam
keywords.
iii.
At e-mail attachment level
The following items are the pattern to look for:
•
Image containing spam keywords. The pattern can be figured out by
recognising the properties of the image and compare it with template
of image spam.
1.5
Objective
This study will be focusing on three main objectives:
i.
To study other solutions to improve the percentage of catch rate and to reduce
processing load.
11
This study will try to look for other antispam solutions which are avoiding
Bayesian in its implementation. Bayesian is known with its probabilistic
approach which requires a lot of sample data and yet intensive calculation for
a higher accuracy. With a large data and calculation it will require more
processing power which reduces overall performance of an antispam product.
Thus, a simple way known as pattern-based filtering will serve a better
performance as its functionality does not need to thoroughly look into the
content of an e-mail.
ii.
To analyse combinations of existing solutions to form the best pattern-based
filtering technique.
There are studies done by researchers that tried to improve anti spam filtering
without using Bayesian filter technique in the implementations. However, in
order to prove they had improved the solutions, they had to compare their
results with the result from antispam product with Bayesian filter. However,
each of these techniques was only focusing on specific function and
parameter. Therefore, this study will help to determine the right combination
of techniques that could potentially improve the antispam solution without
the need to include Bayesian filter.
iii.
One antispam system will be developed at the end of this study. The said
system will be formed by a combination of selected techniques.
The said system will be developed without Bayesian. But, for a comparison
purposes, result from an existing product that is using Bayesian will be used
to compare with the result of the newly developed system to prove that an
improvement has been made.
1.6
12
Scope
This study is introducing techniques to improve antispam, especially to
reduce the erroneous and the cost of performance generated by Bayesian filter.
Although there are numbers of techniques focusing on other areas of improvement,
this paper will not study the foundation of those techniques.
This study will further study on the suitability and compatibility of each of
mentioned techniques in order to form the pattern-based filtering technique to be
used in an antispam solution or product. Having the right tool and a controlled
environment, the said antispam system will be developed by using pattern-based
filtering, and will be tested for a comparison study against other antispam solution
that is using Bayesian.
The proposed antispam system shall have three levels of protection where the
pattern-based filtering can be applied:
i.
Network Level
ii.
E-mail Content Level
iii.
E-mail Attachments Level
However, as time is a major constraint factor, the scope of this study will be
limited to focusing on implementing the protection at network level. Although the
focus is small, but it shall meet its objective as filtering at network level will reduce
spam in the server, thus improving accuracy of spam blocking. It will also eliminate
extra load in the server when number of e-mail to process is getting less.
For the mentioned test, a collection of spam will be fed into the mentioned
antispam system, where these samples are actually collected from actual spam.
However, the samples are limited to spam written in English characters only and the
image spam will not the scanned for optical character recognition (OCR).
13
1.7
Hardware Specification for simulation
An existing computer with the following specification will be used to
develop the mentioned antispam system:
Table 1.1: Specification of the Computer for the Simulation.
No. Type
Specifications
5.
CPU
AMD Phenom (Quad Core)
6.
RAM
2 GB
7.
Hard Disk
500GB
8.
Operating System
OpenSolaris
A justification on choosing a unix machine will be in Chapter 3.
1.8
Summary
Problems as described in this chapter have shown that there is a need to
improve the existing techniques and methods for a better result includes improving
the utilization of system resources. As the problems are narrowing into catch rate and
performance issues, therefore this chapter focuses on three objectives in resolving
the issues. Firstly, study on other solution. Second objective is to form a patternbased filtering technique based on combination of solutions. Third, to developed an
antispam solution containing the said solution and perform a performance evaluation
for a comparison with Antispam with Bayesian filter.
14
CHAPTER 2
LITERATURE REVIEW
In this chapter, some preliminary studies on antispam are reviewed,
especially on spam filtering techniques that do not require Bayesian to be in place.
This chapter discusses the approaches taken by other studies and its potential to be
part of the new system.
2.1
Introduction
In order to improve the efficiency of an antispam product, a few researches
have been carried out, but each of those focused only on specific way to improve
specific task that had not been resolved by Bayesian for example image spam and
network level protection. Pattern-based filtering is a combination of techniques that
use pattern rather than statistical method as introduced by Bayesian in recognizing
spam in an e-mail. Nevertheless, pattern-based alone will not enough to filter all
spam but it is enough to eliminate problem caused by Bayesian.
2.2
Antispam Solutions without Bayesian
This section describes types of spam and measures to prevent it from
attacking a network as a result from studies all over the world. The importance of
these methods and techniques will help this study to find a better solution.
15
2.2.1 Image Spam
As spam evolved into a new look, researchers were looking for ways to
protect one network from potential threats caused by spam. Image spam is the term
used where spam is not in plain text form anymore, but it is an image that carries the
text. Hence, Optical Character Recognition (OCR) was developed to fulfil the needs.
A possible way to detect image-based spam is using a pipeline of an optical
character recognition (OCR) system that extracts and recognizes embedded text,
followed by a text classifier that separates advertising text from legitimate content.
While this solution promises to detect spam with a certain level of accuracy, the
existing OCR algorithms are computationally expensive and thus cannot operate on
heavily loaded e-mail servers. (Nhung & Phuong, 2007)
Thus, overall process taken by OCR is not efficient where it will need higher
processing load and time. The deficiency continues with the inability to recognize
distorted text in the image. Figure 2.1 is an example of distorted text.
Figure 2.1: Sample of Distorted Text in Image
Works by Mehta et al. (2008) has shown that they have found a way to detect
image spam based on Near-Duplicate detection algorithm which is based on the
intuition that they can recognize a lot of images similar to an identified spam images,
where image spam usually generated from a template. Given enough training data,
they should be able to detect large volumes of image spam, while being open for
further training.
However, that technique is based on a trained data, it needs to store the
characteristics of each image spam in order to use it to recognize similar image spam
in the future. The disadvantage of this technique is that it will slow down the overall
process when the database getting huge.
16
However, Nhung and Phuong (2007) used a slightly different technique,
where the method was also not to extract text from an image. Instead, it uses an
edge-based feature vector, which can be computed efficiently, to represent major
shape properties of the image. Since most of spam images contain large proportions
of text as of Figure 2.2 below, they must have shape representation similar to that of
the other text intensive images.
(a) Image containing only text
(b) Image with photographic elements
Figure 2.2: Examples of spam images
Nhung and Phuong (2007) claimed that they have described a new method for
detecting spam e-mail with content embedded in images. Given an image, their
method does the following:
i.
Firstly, extracts an edge-based feature, which summarizes the global
information of the image.
ii.
Then, it computes a vector of similarity scores between the image and a set of
templates that contain only text.
iii.
The similarity vectors then serve as input for Support Vector Machine (SVM)
training and classification.
The use of edge-based feature allows capturing regularities in shapes of text
intensive spam images while does not require costly computations. The method is
fast because it does not use computationally expensive image processing and text
recognition steps. It was claimed that, the method separates spam images from non-
17
spam ones with accuracy of 80% and higher. Their method uses SVM combined
with vector representation.
As a comparison, table below will show the disadvantages and disadvantages
of those three methods as mentioned above:
Table 2.1: Comparison of Methods to Detect Image Spam
Solution
OCR
Advantages
Good at text recognition
Disadvantages
Fail to detect right text in
distorted form
Near-Duplicate detection
Good at finding
algorithm
similarities in to stop mass and require more hard disk
spread of image spam
Require to train the data
space and intensive
calculation
SVM with vector
Good at differentiating
The accuracy is at 80%,
representation
image spam and non-
which means a lot of
image spam by measuring
image spam will slip
the properties of the image through.
for comparison with
templates
2.2.2 Botnets and Network Level Protection
There was another study trying to move the focus out of content filtering but
targeting to recognizing the spam from network-level characteristics. Ramachandran
and Feamster (2006) had stressed out that much attention has been devoted to
studying the content of spam, but comparatively little attention has been paid to
spam’s network-level properties. Conventional wisdom often asserts that most of
today’s spam comes from botnets. However, based on the study on a botnet named
Bobax, they found that during the 17-month observation, Bobax did not send a large
amount of spam regardless of how long the botnet was active. More than 65% of
18
infected hosts sent spam only once, and nearly 75% of IP addresses send spam for
less than two minutes, although many of them send several e-mails during their
existence.
In that sense, paying attention to filtering spam based on the characteristics of
botnet is not worth awhile. The resources collected from all over the world will be
left unutilised, even though having such resource is a good practice for protecting the
network from future threat caused by botnets.
However, some antispam products already introduced the following
techniques in order to cater for network level protection:
i.
Sender reputation, enable a system to collect information of each e-mail
address and IP address and learn the behaviour of those sources in sending
spam including percentage of spam over time. Senders with good reputation
will be given a smooth connection, while the senders with bad reputation will
be given limited access to send e-mail.
However, there are two types of sender reputation technique:
a)
Rely on remote database
Cloudmark Network Classifier (CNC) was the first to be developed for
this functionality by Prakash and O’Donnell (2005). It was prototyped
using Vipul’s Razor technique and was released as an open source
project in 1998. Along with major update of Razor2, they founded the
product.
CNC is a community-based filter training system. It does not rely upon
any one semantic analysis scheme but upon a large set of orthogonal
signature schemes that are trained by the community (Prakash and
O’Donnell, 2005). The CNC consists of 4 components: agents,
nomination servers, catalog servers, and reputation system (known as
TeS – Trust evaluation System).
The agents is the application to report a feedback, either spam or not, to
the nomination server. The feedback take the form of a set of small
fingerprints that are generated from the message rather than
19
transmitting the entire content. The CNC require the submitted
feedback to be corroborated by multiple trusted members of the
community. Once the fingerprint is considered as spam, then the
fingerprint will be added to the catalog servers. The fingerprints will be
queried from this catalog server giving information to any e-mail
system about the likelihood of spam or not.
b)
Local developed database
CNC was good as it store shared information gathered from the world.
But, too much dependency on the catalog server will give extra cost in
maintaining the connection. Unresponsive query will slowdown the
performance. Therefore, sender reputation in local database is a more
convenient approach. The first product using this technique is
TurnTide, which was bought over by Symantec Corp.
By using routing technology to stop spam at the network layer,
TurnTide wants to make it technologically can economically infeasible
for spammers to target corporate e-mail servers (Schultz, 2004).
TurnTide incorporates TCP traffic shaping, a basic engineering
technique
for
inserting
quality-of-service
and
error-handling
mechanism into a network.
TCP traffic shaping is a way to limit access to bandwidth. Over time,
TurnTide will learn the behaviour of each incoming IP address. IP
addresses with bad reputation will be given a small pipe of bandwidth,
while the good ones will be given a bigger pipe. This will make the
spammers frustrated for not being able to spread the spam as fast as
they want.
The advantage of this technique as compared to CNC is that it does not
rely on remote database, as its own database will be developed from the
incoming connections.
ii.
20
Sender Policy Framework (SPF) and DomainKeys Identified Mail (DKIM),
these two approaches try to let the original e-mail provider to tell the world
from where or which IP address would they send their e-mail from.
The e-mail providers are required to register their IP addresses of SMTP
server into DNS. Therefore, the receiving e-mail server will only accept
connection from registered IP address, thus eliminating connections created
by spammers at their own customizable smtp server. For example, domain
Hotmail.com is registered with SPF of IP address 10.10.10.10, so spammers
who normally do e-mail forgery by sending e-mail as a hotmail’s user but
from an unknown IP address will be rejected.
Advantage of this method is it will block most of the spam sources because
pretending as a genuine e-mail domain is the best technique to attract one’s
attention.
iii.
Recipient validation, which focusing on terminating the e-mail connection
whenever the recipient is found to be invalid, thus reducing the needs to
further process the e-mail content.
Most e-mail server actually will do this automatically. But, there are some email servers hiding this technique but received all incoming e-mail as to
protect the information whether one recipient is valid or not. Such
information if exposed to spammers will be useful for them to start
spamming the user. But the disadvantages of that protect is that the server
need to process and store the incoming e-mail, which is not good at saving
the processing load.
iv.
E-mail harvesting, where an IP address is seen to try to make a connection to
send e-mail but it close the connection when the server acknowledge whether
the recipient exist or not in the system. Normally, the attacker will try to use a
dictionary to launch the attack, it does not seem to be harmful to the system,
but the list, which was gathered by this attacker, can be exploited by
spammers for their spam activities.
v.
21
DNS Blacklist (DNSBL), it contains a collection of IP addresses which were
reported to send spam. Any e-mail system can make use of this DNS record
to filter IP addresses that are known as spammers. But the list was not
accurate and not real-time, thus some IP addresses were still blacklisted even
though the real spammer has moved to another IP address. Example of the
provider of this DNSBL is spamhaus, which was recently sued for false
blacklist. Although an IP address will stay in the blacklist just for a few days,
but that will be enough to cause losses to any company that rely on internet
and e-mail as the main source of business.
Despite of the pros and cons, Ramachandran and Feamster (2006) concluded
that the mitigation method based on network-level properties, such as IP address of
the last remote mail relay, was a reasonable way to filter the spam before it is
processed for content filtering. This would reduce the need for content processing
power, and yet eliminate the threat before it reaches the system. The research had
also shown the importance of network-level properties, which has two important
properties that could potentially lead to more robust filtering:
i.
Network-level properties are less malleable than those based on an e-mail’s
contents
ii.
Network-level properties may be observable in the middle of the network, or
closer to the source of the spam, which may allow spam to be quarantined or
disposed of before it ever reaches a destination mail server.
However, the study done by Ramachandran and Feamster (2006) was
focusing on understanding the network level behaviour of spammer which was not
proposing any solution to the findings. But, the finding was good to bring up
awareness on e-mail providers to have network level protection for spam issues.
Table 2.3 below will show the pros and cons of the mentioned network-level
protection for spam:
22
Table 2.2: Comparison of Network-Level Protection Techniques
Solution
Advantages
Disadvantages
CNC
Valid information as it
Dependencies to a remote
was contributed by trusted
server will need a
source all over the world
sustained connection.
TurnTide
Local cache and self learn. Limit of IP address can be
stored is up to a few
million. However, IP
address with less activity
will be removed. There is
no need to store all IP
address for a long time.
SPF and DKIM
Able to block incoming
Depends on remote DNS,
connection from forgery
but genuine domain will
sources.
normally provide a reliable
DNS service.
Recipient Validation
Able to prevent extra load Will be resource intensive
to process the content.
as the query to the
database will be done each
time a connection is
established.
Solution
Advantages
Disadvantages
E-mail Harvesting
Able to stop connection
It requires an intelligence
Protection
that is wasting the
to detect the activity.
network resources.
DNSBL
Good at preventing known Depends on remote
spam
sources
from resource.
entering the network.
23
2.2.3 Dynamic Contents
In early days, antispam was developed to look for spam by keyword
matching. If an e-mail contains a lot of text, the possibility to find keywords or
sentences that match with those in the database will be high. But, that had made
spammers to be more creative in composing spam, they put less text, and sometimes
only a link is embedded in the e-mail.
Semantic enrichment is another method to improve the effectiveness of
antispam filtering system which was introduced by Kim et al. (2004). It usually done
by remodelling database schemas in a higher data model in order to explicitly
express semantics that is implicit in the data. Kim et al. (2004) used the technique to
enrich semantically useful terms to an e-mail that contains only few terms. The
semantic enrichment process is divided into two, structural information and
semantical concepts.
The method introduced by Kim et al. (2004) consists of two phases, training
and testing. The training phase will create classes of spam and non-spam where each
class will contain sub-classes. These classes will be kept in DAML+OIL form.
DAML is an acronym for DARPA Agent Markup Language. Meanwhile,
DARPA is an acronym for Defence Advanced Research Project Agency, which is
the central research and development organization for the Department of Defense of
US. The goal of DAML is to develop a language and tools to facilitate the concept of
the Semantic Web. OIL is formed by Ontology Inference Layer, which is a proposal
for a web-based representation an inference layer for ontology and combines the
widely used modelling primitives from framed-based languages with the formal
semantics and reasoning services provided by description logics.
The Test Phase has two steps, Structural Information and Semantic Concepts.
The method used by Kim et al. (2004) will convert e-mail into RDF (DAML+OIL)
format, therefore e-mail can be mapped easily with the ontology.
But, one of the disadvantages of that method, it will need to convert each email into RDF format before it can proceed with the comparison. This method is
efficient if the contents of the e-mail are small, but not efficient for e-mail with large
24
content and not efficient for an e-mail system that receives a large volume of emails.
Liang et al. (2006) has introduced a Novel Approach for Spam Classification
(NASC) which experimentally proven to raise both the recall ratio and precision
ratio, and enhance the ability to self-adaptation and diversity for the spam.
NASC is an immune-based model to cater for dynamic spam detection. This
model is composed of three parts, first is Mail Character Presenting Module
(MCPM), second, training module and third, filtering module (Liang et al., 2006).
They emphasised that MCPM will use vector spam model (VSM) and present the
received mail in discrete words. VSM has been widely used in the traditional
information retrieval field. Most search engines also use similar measures based on
this model to rank web documents. Documents are mapped into term vector space. A
document is conceptually represented by a vector of keywords extracted from the
document, with associated weights representing the importance of the keywords in
the document and within the whole document. The form of the vector is listed below.
V = (t1,w1,… ,tn,wn)
Where t1 represent keyword and wi is the associated weights representing the
importance of the keywords in the document.
The training module consists of four elements:
i.
Gene Library
According to the structure of a document, the gene library is divided into two
subsets: mail list gene library and content gene library. This library actually is
a collection of keywords where mail list gene library consist of suspicious email addresses or domains which are issued by the authoritative institution.
This kind of library can be updated from internet and from manual insertion.
Meanwhile, the content gene library consists of keywords which are picked
up from any contributors from the internet
ii.
25
Immature Detector
Immature detector is the term vector which is organized in a pre-defined
structure. Every element of the term vector is all from gene library.
iii.
Affinity
Affinity is the match strength between two documents
iv.
Mature Detector
If the match strength between an immature detector and one of the training
mails is below the pre-defined threshold, the detector is considered as a
mature detector.
The Filtering Module consists of:
i.
Memory Detector
It is considered as reliable detector. A mature can become a memory detector
if the matching threshold is over active threshold.
ii.
Co-stimulation.
It is the signal comes from the administrator to the system, which is used in
the following situations:
a)
When mature detectors detect a suspicious e-mail, system need this
signal to judge the suspicious e-mail whether it is a spam or not.
b)
When the count field of one mature detector is over active threshold,
system will need the co-stimulation signal. If an affirmative costimulation signal writes back to system from network administrator
in fixed period, this mature detector will become memory detector.
Otherwise, it will be recognized as invalid detector and be deleted
from mature detector set.
26
Comparison between NASC and Semantic Enrichment is as per Table 2.4
below
Table 2.3: Comparison of Techniques to Cater For Dynamic Contents
Solutions
Advantages
Disadvantages
Semantic Enrichment
Able to catch spam with
Not efficient if the e-mail
minimal contents
system is receiving big email or large volume of email traffic.
NASC
Able to utilise computer
May require a bigger
memory for a faster
RAM size to cater for the
matching process and
memory detectors if the
terminate the idle and
volume of the traffic is
unnecessary detectors
huge. It also requires a
training set.
2.3
Antispam with Bayesian
In normal operation, an antispam system will require a content filter, and
Bayesian filter. The content filter will check content of e-mail and find a match with
an inbuilt database through regular expression methods. Every matched expression
will contribute a score to the message, where the total score that over the preset
threshold will classify a message as spam.
Here are some examples of regular expression that are used by antispam,
Schwartz (2004):
/CLICK\s.{0,30}(?:HERE|BELOW)/s
/click\s.{0,30}(?:here|below)/is
27
The meaning of above examples are: if there is a sentence either all letters are
capital letters or not of “click here” or “click below” or “clicks here” or “clicks
below” or “click … here” or “click … below”
The examples above are pre-defined by the administrator, but can be
collected from the internet from the suggestions by the internet users.
At the meantime, Bayesian works on each word without the need for its
dependency and sequence. The real work requires statistical functionality with a set
of trained data. According to Bekman (2006), we need to feed it with good mail and
with undesired e-mail, from that point on a Bayesian filter will try to decide what is
spam and what is not and sort the e-mail to different folders, but a constant is still
needed on both.
One of the drawbacks is it require training, which will give weightage to each
word and enable Bayesian to use it statistical method to measure the message.
2.4
Summary
The studied techniques in this chapter have given enough input to the thesis
to move forward with the next agenda. The following Table 2.5, which consists of
techniques with most advantages in each problem area, will show the possible
combination for the pattern-based filtering to be developed.
Table 2.4: Recommended Combination for Pattern-Based Filtering Technique
Solutions
Advantages
Extra Requirements
SVM with Vector
To solve issues with
None
Representation for Image
image spam
Spam
Sender Reputation – local
To shape the traffic based
Local database to cache all
database (TurnTide)
on behaviour, thus
IP address
limiting bandwidth access
to spammers
28
Table 2.5: Recommended Combination for Pattern-Based Filtering Technique
(Continue)
Solutions
Advantages
Sender Policy Framework Will only allow valid IP
Extra Requirements
None
address representing a
valid domain to send emails
Recipient Validation
Will drop connections
None
with invalid users thus
preventing the CPU from
being used to scan the email contents
DNSBL
Good at preventing a
Local DNS cache to
known spammer sources
eliminate dependencies
from sending e-mails
with remote DNSBL
server
NASC
Good for content based
Require bigger RAM size
checking with better
performance
Therefore, the next chapter is introducing the methodologies to use in
developing the suggested solution and the solution to be tested in a control
environment in order to the objective is met. However, the thesis is focusing on the
filtering mechanism at network level. It is undeniable the network level filtering only
compliment the content filter, but it will also help to reduce the processor’s load thus
giving a more power to the content filter to work efficiently.
CHAPTER 3
METHODOLOGY
This chapter puts the proposal into realization by forming a methodology to
develop an antispam system by using pattern-based filtering technique.
3.1
Introduction
This chapter discusses the proposed methodology in developing an antispam
system by using pattern-based filtering technique. Each component that was selected
in previous chapter will be mapped into the system architecture depending on the
functionality and its requirement. This AntiSpam with Pattern-Based Filter (ASPBF)
is expected to open an opportunity to a better antispam solution.
3.2
Methodology
The method to be used in this study is to implement antispam with patternbased filtering technique, which will protect the system at network level, e-mail
header level and content level. Figure 3.1 and figure 3.2 show flow of process and
architecture of the proposed system.
30
3.2.1 Traffic Shaping
As we can see from figure 3.1, each incoming connection will be analysed for
its historical behaviour and record of rejection to classify them as trusted or not
trusted sources, where the trusted sources will be allowed to make the connection,
while the untrusted sources will be rejected for a period of time as determined. When
other filters in the same system reject an IP address, the IP address will be recorded
in the traffic-shaping database. Every rejection will increase the counter of the IP
address in the database, if the counter is above the preset threshold, the IP address
will be rejected. The counter will be reset if the duration of rejection has elapsed.
3.2.1.1 Functionalities and its preset conditions requirements.
i.
maxSMTPipDuration, is the duration to allow connection per IP address.
ii.
maxSMTPipConnects, is The maximum number of SMTP connections an IP
Address can make within the maxSMTPipDuration.
iii.
maxSMTPipExpiration, is the duration to keep IP addresses rejected by
maxSMTPipConnect before they are allow to create new connections.
iv.
Record each connecting IP address
v.
IPNumTries, is the counter for the occurrence of the IP address within
maxSMTPipDuration.
vi.
Reject connection if IPNumTries is more than maxSMTPipConnects within
maxSMTPipDuration.
vii.
Reject IP address detected by item (iii) until the time reaches
maxSMTPipExpiration.
viii.
Reset the IPNumTries if an ip reconnect after maxSMTPipExpiration.
ix.
Each IP address with bad history in the next filters will be kept in the same
cache and be rejected until the expiration time.
31
3.2.2 Sender Reputation
If an IP address is not rejected by the traffic shaping process, it will then be
processed and validated against its reputation. The database of sender reputation is
maintained by SenderBase (www.senderbase.org), it is a collection of IP addresses
as collected by the sensors located every in the world. The sensors collected IP
addressed that are found to send spam or being a zombie machines and control by
others through a botnet technique. The result of an IP address can be resolved
through a DNS query that is made to the SenderBase’s DNS servers. However, in
order to make the query faster, each result will be cached locally and will be
removed once the duration to keep it has elapsed the preset threshold.
Records kept by SenderBase contain blacklist and whitelist with weightage,
therefore it is recommended to monitor the accuracy of the records before we can
decide to block or allow the traffic. Eventhough it will not block the connection, but
adding a score to the final result will also help to filter spam.
The followings describe the functionality of the method:
i.
given an IP address of the sender, Address, a dns record is queried from the
provider of Sender Reputation, Host. This is done by function
SenderBaseOK, which will return a value of 0 or 1.
ii.
A value returned by the Host will identify the level of reputation the Address
has and shall be treated accordingly, either to block or to let it pass through.
iii.
The Address and its reputation value will be kept in cache until its value of
expiration is exceeded. The function keep in the cache is SBCacheAdd, whilst
to remove it is cleanCacheSB. This will ensure the next query will be faster
and less dependent on the external sources.
3.2.3 Sender Policy Framework
The next filter is to verify the SPF record of the matching IP address and
Domain Name. Each domain owner has the capability to register in the DNS their
32
trusted sources to deliver e-mails of their domain. Therefore if an IP address is
matched with its sender’s domain, the connection will be granted to pass through.
The followings are the functionality of this method:
i.
There are two items required to verified the validity of the Incoming IP
address, first is the IP address, second is the domain name of the sender.
ii.
If the SPF record of the domain has verified that the incoming IP address is a
trusted source, the connection will be granted. The function to do verification
is SPFok, which will return 0 or 1.
iii.
All result from the SPF request will be cached by function SPFCacheAdd and
a record will be kept in the cache untill its expiry time and to be removed by
function cleanCacheSPF.
Figure 3.1: System Architecture (Part 1)
33
3.2.4 DNS BlackList (DNSBL)
Figure 3.2 show the next filters as a continuity of the Figure 3.1. DNS Black
List (DNSBL), also known as RealTime BlackList, is almost similar to SenderBase,
but it will only keep the blacklisted IP addresses, while SenderBase keep both
blacklist and whitelist of IP addresses with weightage. The sources of the blacklist
are maintained by various parties and up to the users to choose from. But the popular
sources are spamhaus.org and spamcop.net, where most Internet users recommend
using due to its reliability. Volunteer runs this kind of organizations, thus the
reliability and accuracy of the data are still questionable even though it was said to
be more then 80% accurate.
The followings describe the functionality of the method:
i.
given an IP address of the incoming connection, it is checked against a DNS
record provided by the RBL provider. There are currently three (3) provider
used for this application: (1) bl.spamcop.net, (2) list.dsbl.org, (3)
zen.spamhaus.org.
ii.
the function to do the checking is RBLok which will return 0 or 1, 0 for a bad
status and 1 for a good status.
Figure 3.2: System Architecture (Part 2)
34
3.2.5 Recipient Validation
The last filter as describe in Figure 3.2 is the Recipient Validation. Each email system should have its own database of users, thus this database can be utilized
by the proposed antispam in ensuring only e-mail with valid recipient is processed,
avoiding unnecessary load to the server.
The following is the functionality of this method:
i.
Each incoming e-mail will be checked against LDAP for the validity of the
recipient, if the recipient cannot be found in LDAP, the e-mail is rejected.
This is to ensure there is no need for a further processing and to save CPU
load.
ii.
Function to do the LDAP query is LDAPquery, which will return 0 for a
failure and 1 if succeeded.
iii.
Each result returned by the LDAP server will be kept in a memory cache by
an array LDAPlist.
3.3
Improvement To Expect
As per figures above, we can identify which part to improve from old-fashion
antispam, and they are:
i.
Number of e-mails to scan based on content will reduce as more e-mails are
eliminated at network level.
ii.
It is not just to reduce the total messages to scan but also reduce the CPU
load by eliminating the untrusted sources from sending messages at network
level.
iii.
35
In normal practice, 100% of incoming messages will be processed and
scanned for spam, however using this ASPBF it is expected to reduce the
number of messages to be scanned.
iv.
The ultimate expectation of this simulation is to prove the techniques
embraced by ASPBF will help to improve overall process of filtering spam.
3.4
Sample Data
Data to be used in this test will be categorized into the following
characteristics:
i.
Spam coming from IP addresses that are recognised as spam sources.
This type of spam should be eliminated at network level, preventing further
processing load.
ii.
Spam or e-mail with invalid recipient.
The connection will be terminated as soon as the recipient in the e-mail
header is found to be invalid.
iii.
Spam in plain text
The plaintext spam is the basic form of spam. This can be used to measure
the performance of the content-based scanning.
iv.
Spam in mix of text and image
This will require both content scanning and image scanning, where Bayesian
is possibly to give a negative result.
v.
36
Spam in image
Bayesian filter will apparently fail to capture the spam.
3.5
Requirements for the simulation
The proposed antispam system will be developed using a Perl software and
the programming language is Perl language, as it is the language used by
SpamAssassin to integrate with Bayesian filter. Therefore, it will be easier to
develop the antispam with Bayesian for benchmarking with the new antispam
system. Furthermore, this system which is developed in Perl can easily be integrated
with any existing e-mail software.
A unix machine is recommended in the development of the said system as
well as to run the benchmark due to:
i.
Other unnecessary application that is running in the Operating System can be
closed as to avoid it from affecting the performance benchmarking process.
ii.
The antispam system will be developed using Perl which is more suitable to
run in Unix-based system. Perl also being used to develop the previous
antispam (SpamAssassin), therefore using the same platform will provide a
fair competition.
3.5.1 Development Tools
As to ensure the Antispam with Pattern-Based Filter (ASPBF) is comparable
to existing SpamAssassin with Bayesian from the view of processing load and
performance, Perl programming language will be used to reduce biasness when
comparing it with SpamAssassin which is also developed using the same
programming language.
37
Although applications developed in Perl are not performing better that those
developed in C in term of speed of execution, but the objective of this development
is to show results from both systems, one with ASPBF another one with Bayesian.
Comparing those results will potentially prove that an improvement can be made to
existing antispam solution. In this development, multi-threaded programming is used
to ensure the application is able to handle more concurrent connections.
3.5.2 Prototyping
The ASPBF was developed using Perl programming language with all the
necessary modules were pre-loaded and included in the main program. The logics of
each of the filter will be described below.
i.
Traffic Shaping
The records of IP address and their history are kept in local cache, each
record will have its expiry time. Exceeding the expiry time will cause the
record to be removed from the cache.
sub PBOK {
my($fh,$myip,$lvl)=@_;
my $this=$Con{$fh};
$myip = $this->{cip} if $this->{ispip} && $this->{cip};
d('PBOK');
$this->{prepend}="";
my $t=time;
my $ip=ipNetwork($myip,($PenaltyUseNetblocks ? 24 : 32));
return 1 if (!exists $PBBlack{$ip});
my($ct,$ut,$level,$totalscore,$sip,$reason)=split("
",$PBBlack{$ip});
Figure 3.3: Function of Traffic Shaping (Part 1)
38
$this->{mypbreason}="totalscore for $this->{ip} is $totalscore,
last penalty was '$reason'";
return 1 if $totalscore<$PenaltyLimit;
$this->{prepend}="[PenaltyBox]";
mlog( $fh, "[monitoring] $this->{mypbreason}" ) if $DoPenalty ==
2 || $DoPenalty == 3;
return 1 if $DoPenalty==2 || $DoPenalty==3;
return 0;
}
Figure 3.4: Function of Traffic Shaping (Part 2)
ii.
Sender Reputation
The record of each IP address will be tested with a SenderBase’s provider, in
this study, the provider is test.senderbase.org. Provided with an IP address:
•
Check the cache for a record of that IP address
•
If not found in the cache, do DNS query to test.senderbase.org
•
Receive the value, if the value falls under whitelist, record the
information in Penalty Box but as a whitelist IP address. Otherwise if
a blacklist value is received, record it in the Penalty Box as
blacklisted.
#queries the SenderBase service
sub SenderBaseOK {
my ( $fh, $ip, $domain ) = @_;
my $this = $Con{$fh};
$ip = $this->{cip} if $this->{ispip} && $this->{cip};
my $this = $Con{$fh};
d('SenderBaseOK');
my $results;
my $query;
my $cache;
Figure 3.5: Function of Sender Reputation (Part 1)
39
my $skip;
my $tlit;
return 1 if !$CanUseSenderBase;
return 1 if $this->{ip} =~ /$IPprivate/;
#return 1 if $this->{contentonly};
my ( $ipcountry, $orgname, $domainname ) = split( "-",
SBCacheFind($ip) );
my $blacklistscore;
my $fortune1000;
my $domainrating;
my $bondedsender;
my $domainname;
my $resultip;
if ( !SBCacheFind($ip) ) {
eval {
$query = Net::SenderBase::Query->new(
Transport => 'dns',
Address
=> $ip,
Host
=> 'test.senderbase.org',
Timeout
=> 10,
);
$results = $query->results;
};
if ($@) {
&sigon();
return 1;
}
$bondedsender
= $results->ip_in_bonded_sender;
$blacklistscore = $results->ip_blacklist_score;
$orgname = $results->org_name;
$resultip
= $results->ip;
$fortune1000
= $results->org_fortune_1000;
$domainname
= $results->domain_name;
$domainrating = $results->domain_rating;
Figure 3.6: Function of Sender Reputation (Part 2)
40
$ipcountry = $results->ip_country;
SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" );
&sigon();
} else {
$cache = 1;
}
if ($DoOrgBlocking) {
$this->{prepend} = "[Organization]";
$tlit = "[monitoring]" if $DoOrgBlocking == 2;
$tlit = "[scoring]" if $DoOrgBlocking == 3;
if ($orgname =~ ( '(' . $whiteSenderBaseRE . ')' ) ||
$domainname =~ ( '(' . $whiteSenderBaseRE . ')' ))
{
mlogRe( $fh, $1, "WhiteOrg" );
$this->{noprocessing} = 1;
$this->{noprocessingreason}
= "senderbase
regex";
pbWhiteAdd( $fh, $ip, "WhiteSenderBase:$1" );
my $nsborgValencePB = 0 - $sborgValencePB if
$sborgValencePB;
pbAdd( $fh, $ip, $nsborgValencePB,
"WhiteSenderBase:$1" ) if $DoOrgBlocking != 2;
mlog( $fh, "$tlit White
Organization in SenderBase:$1", 1 ) if $SenderBaseLog == 2;
return 1;
}
if ($orgname =~ ( '(' . $blackSenderBaseRE . ')' ) ||
$domainname =~ ( '(' . $blackSenderBaseRE . ')' ))
{
mlogRe( $fh, $1, "BlackOrg" );
pbWhiteDelete( $fh, $ip, "BlackSenderBase:$1" );
pbAdd( $fh, $ip, $sborgValencePB,
"BlackSenderBase:$1" ) if $DoOrgBlocking != 2;
$this->{messagereason} = "Black
Organization in SenderBase:$1";
Figure 3.7: Function of Sender Reputation (Part 3)
41
$ipcountry = $results->ip_country;
SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" );
&sigon();
} else {
$cache = 1;
}
if ($DoOrgBlocking) {
$this->{prepend} = "[Organization]";
$tlit = "[monitoring]" if $DoOrgBlocking == 2;
$tlit = "[scoring]" if $DoOrgBlocking == 3;
if ($orgname =~ ( '(' . $whiteSenderBaseRE . ')' ) ||
$domainname =~ ( '(' . $whiteSenderBaseRE . ')' ))
{
mlogRe( $fh, $1, "WhiteOrg" );
$this->{noprocessing} = 1;
$this->{noprocessingreason}
= "senderbase
regex";
pbWhiteAdd( $fh, $ip, "WhiteSenderBase:$1" );
my $nsborgValencePB = 0 - $sborgValencePB if
$sborgValencePB;
pbAdd( $fh, $ip, $nsborgValencePB,
"WhiteSenderBase:$1" ) if $DoOrgBlocking != 2;
mlog( $fh, "$tlit White
Organization in SenderBase:$1", 1 ) if $SenderBaseLog == 2;
return 1;
}
if ($orgname =~ ( '(' . $blackSenderBaseRE . ')' ) ||
$domainname =~ ( '(' . $blackSenderBaseRE . ')' ))
{
mlogRe( $fh, $1, "BlackOrg" );
pbWhiteDelete( $fh, $ip, "BlackSenderBase:$1" );
pbAdd( $fh, $ip, $sborgValencePB,
"BlackSenderBase:$1" ) if $DoOrgBlocking != 2;
$this->{messagereason} = "Black
Organization in SenderBase:$1";
Function 3.8: Function of Sender Reputation (Part 4)
42
return 0 if $DoOrgBlocking == 1;
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$SenderBaseLog == 2;
}
}
return 1 if !$DoCountryBlocking;
return 1 if $ipcountry =~ $NoCountryCodeReRE;
$this->{mycountry} = 0;
if ( $ipcountry && $DoCountryBlocking
&& ($ipcountry =~ $CountryCodeBlockedReRE ||
($CountryCodeBlockedReRE =~ /all|\*/i && $ipcountry !~
$MyCountryCodeReRE && $ipcountry !~ $CountryCode
ReRE))){
$this->{messagereason} = "Blocked Country $ipcountry $orgname";
$this->{prepend} = "[CountryCode]";
$tlit = "[monitoring]" if $DoCountryBlocking == 2;
$tlit = "[scoring]" if $DoCountryBlocking == 3;
pbAdd( $fh, $ip, $bccValencePB, "BlockedCountry$ipcountry") if $DoCountryBlocking != 2;
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$DoCountryBlocking == 2 || $DoCountryBlocking == 3;
SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" );
return 0 if $DoCountryBlocking == 1 || $DoCountryBlocking
== 4;
return 1;
}
return 1 if !$DoSenderBase;
return 1 if !$CountryCodeRe && !$MyCountryCodeReRE;
$tlit = "[monitoring]" if $DoSenderBase == 2;
$tlit = "[scoring]" if $DoSenderBase == 3;
Function 3.9: Function of Sender Reputation (Part 5)
43
$sbhccValencePB = 0 - $sbhccValencePB if $sbhccValencePB;
if (
$sbhccValencePB < 0
&& $ipcountry
&& $MyCountryCodeReRE
&& $ipcountry =~ $MyCountryCodeReRE ) {
$this->{prepend}
= "[CountryCode]";
$this->{mycountry}
= 1;
$this->{messagereason} = "Home Country Bonus $ipcountry $orgname";
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$SenderBaseLog == 2;
pbAdd( $fh, $ip, $sbhccValencePB, "HomeCountry-$ipcountry"
)
if $DoSenderBase != 2;
return 1;
}
if (
$sbfccValencePB
&& $ipcountry
&& $MyCountryCodeReRE
&& $ipcountry !~ $MyCountryCodeReRE
&& !( $CountryCodeReRE && $ipcountry =~ $CountryCodeReRE )
) {
$this->{messagereason} = "Foreign Country $ipcountry $orgname";
pbAdd( $fh, $ip, $sbfccValencePB, "CountryCode-$ipcountry",
1 )
if $DoSenderBase != 2;
$this->{prepend} = "[CountryCode]";
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$SenderBaseLog == 2;
return 1;
}
if (
$sbfccValencePB
&& $ipcountry
&& $MyCountryCodeReRE
Function 3.10: Function of Sender Reputation (Part 6)
44
&& $ipcountry !~ $MyCountryCodeReRE
&& ( $CountryCodeReRE && $ipcountry =~ $CountryCodeReRE ) )
{
$this->{messagereason} = "Foreign & Suspicious Country
$ipcountry - $orgname";
pbAdd( $fh, $ip, $sbfccValencePB + $sbsccValencePB,
"CountryCode-$ipcountry", 1 )
if $DoSenderBase != 2;
$this->{prepend} = "[CountryCode]";
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$SenderBaseLog;
return 1;
}
if (
$sbsccValencePB
&& $ipcountry
&& $CountryCodeReRE
&& $ipcountry =~ $CountryCodeReRE ) {
$this->{messagereason} = "Suspicious Country $ipcountry -
$orgname";
pbAdd( $fh, $ip, $sbsccValencePB, "CountryCode-$ipcountry",
1 )
if $DoSenderBase != 2;
$this->{prepend} = "[CountryCode]";
mlog( $fh, "$tlit $this->{messagereason}", 1 ) if
$SenderBaseLog;
return 1;
}
mlog( $fh, "SenderBase showing:
$ipcountry - $orgname", 1 )
if $SenderBaseLog == 2 && $ipcountry;
return 1;
}
Function 3.11: Function of Sender Reputation (Part 7)
45
iii.
Sender Policy Framework
This will require a SPF module version 2 to ensure SPF records can be
extracted regardless of which version of SPF records are used.
Provided with an IP address and a domain name of the sender
•
Check the matching record with local cache
•
If not found, do DNS query from the known DNS server that keep the
records of the domain
•
Receive the SPF record
•
Record the information into the cache for future use
•
If a fail value is received, record in Penalty Box Cache for later use by
Traffic Shaping.
Return the value of the SPF record to the calling function
•
# uses Mail::SPF v2.005
sub SPF2ok {
my $fh = shift;
my $this = $Con{$fh};
my $ip
= $this->{ip};
my $strict;
my $block;
$ip = $this->{cip} if $this->{ispip} && $this->{cip};
my $helo = $this->{helo};
$helo = $this->{ciphelo} if $this->{ispip} && $this->{cip};
undef $helo if $this->{ispip} && $this->{cip} && !$this>{ciphelo};
&sigon();
my $mValidateSPF = $ValidateSPF;
$mValidateSPF = 3
if $switchTestToScoring
&& $DoPenaltyMessage
&& ( $spfTestMode || $allTestMode );
my $tlit = "";
$tlit = "[monitoring]" if $mValidateSPF == 2;
$tlit = "[scoring]"
if $mValidateSPF == 3;
$this->{prepend} = "[SPF]";
my ( $spf_result, $local_exp, $authority_exp, $spf_record,
$spf_fail, $received_spf );
my $mf = lc $this->{mailfrom};
my $mfd = $1 if $mf =~ /\@(.*)/;
my ( $cachetime, $cresult, $cdomain, $chelo ) =
SPFCacheFind($ip);
Function 3.12: Function of Sender Policy Framework (Part 1)
46
if ($cachetime) {
#check for valid cache hit
#
mlog( $fh, "SPF $tlit: (cache) $cresult $cdomain
$chelo", 1, 1 ) if $SPFLog == 2;
if ( ( $mfd eq $cdomain ) && ( lc $helo eq $chelo or $ip eq
$this->{cip}) ) {
$spf_result = $cresult;
}
}
if ( !$spf_result ) {
$cresult = "";
my $query;
eval {
my ( $identity,
if ($mfd) {
$identity =
$scope
=
} else {
$identity =
$scope
=
}
$scope );
$this->{mailfrom};
'mfrom';
$helo;
'helo';
my $res = Net::DNS::Resolver->new(
nameservers => \@nameservers,
tcp_timeout => $DNStimeout,
udp_timeout => $DNStimeout,
retrans
=> $DNSretrans,
retry
=> $DNSretry
);
my $spf_server = Mail::SPF::Server->new(
hostname
=> $myName,
dns_resolver => $res
);
my $request = Mail::SPF::Request->new(
versions
=> [ 1, 2 ],
scope
=> $scope,
identity
=> $identity,
ip_address
=> $ip,
helo_identity => $helo
);
my $result = $spf_server->process($request);
$spf_record = $request->record;
$spf_result
= $result->code;
$local_exp
= $result->local_explanation;
$authority_exp = $result->authority_explanation
if $result->is_code('fail');
$received_spf = $result->received_spf_header;
};
Function 3.13: Function of Sender Policy Framework (Part 2)
47
#exception check
if ($@) {
mlog( $fh, "SPF2OK: $@", 1, 1 ) if $ExceptionLogging;
&sigon();
return 1;
}
&sigon();
if ( $SPFCacheInterval > 0 && ( $spf_result !~ /error/ ) )
{
SPFCacheAdd( $ip, $spf_result, $mfd, $helo );
}
}
if (
$spf_result eq 'fail'
|| $spf_result eq 'softfail' && $SPFsoftfail
|| $spf_result eq 'softfail'
&& $strict
|| $spf_result eq 'neutral' && $SPFneutral
|| $spf_result eq 'neutral'
&& $strict
|| $spf_result
|| $spf_result
|| $spf_result
) {
$spf_fail = 1;
pbWhiteDelete(
eq 'none' && $SPFnone
=~ /unknown/ && $SPFunknown
=~ /error/ && $SPFqueryerror
$fh, $ip );
} elsif ( $spf_result eq 'pass' ) {
$spf_fail = 0;
$this->{spfok} = 1;
}
$received_spf = "SPF: $spf_result";
$received_spf .= " (cache)" if $cresult;
$received_spf .= " ip=$ip";
$received_spf .= " mailfrom=$this->{mailfrom}"
if ( defined( $this->{mailfrom} ) );
$received_spf .= " helo=$helo" if ( defined($helo) );
mlog( $fh, "$tlit $received_spf", 0, 1 ) if $SPFLog;
return 1 if $mValidateSPF == 2;
$this->{messagereason} = "SPF $spf_result";
Function 3.14: Function of Sender Policy Framework (Part 3)
48
if ( $spf_result eq 'neutral' && $spfnValencePB ) {
pbAdd( $fh, $ip, $spfnValencePB, "SPF$spf_result" );
} elsif ( $spf_result eq 'softfail' && $spfsValencePB ) {
pbAdd( $fh, $ip, $spfsValencePB, "SPF$spf_result" );
} elsif ( $spf_result eq 'none' && $spfnonValencePB ) {
pbAdd( $fh, $ip, $spfnonValencePB, "SPF$spf_result" );
} elsif ( $spf_result =~ /unknown/ && $spfuValencePB ) {
pbAdd( $fh, $ip, $spfuValencePB, "SPF$spf_result" );
} elsif ( $spf_result =~ /error/ && $spfeValencePB ) {
pbAdd( $fh, $ip, $spfeValencePB, "SPF$spf_result" );
} elsif ( $spf_fail == 1 && $spfValencePB ) {
pbAdd( $fh, $ip, $spfValencePB, "SPF$spf_result" );
}
$this->{messagereason}.$strict if $strict;
return 1 if $mValidateSPF == 3 && !$block;
if ( $spf_fail == 1 ) {
# SPF fail (by our local rules)
my $reply = $SPFError;
$reply =~ s/SPFRESULT/$local_exp/g;
$Stats{spffails}++ unless $slok;
$this->{prepend} = "[SPF]";
thisIsSpam( $fh, "SPF $spf_result", $SPFFailLog, $reply,
$spfTestMode, $slok, 0 );
return 0;
}
return 1;
}
Function 3.15: Function of Sender Policy Framework (Part 4)
iv.
DNS Blacklist (DNSBL)
DNSBL also known as Realtime Blacklist (RBL), provided an IP address:
•
Check the IP in the cache
•
If not found, do DNS query from a list of DNSBL providers
•
Receive the value, if any value is received, the IP address is
considered as a blacklisted IP address. If no value is received after a
timeout, the IP is considered as clean.
•
Keep records of the information in local cache
•
Keep the blacklisted IP address into Penalty Box cache
•
Return the value to the calling function
49
sub RBLok {
my ($fh) = @_;
my $this = $Con{$fh};
my $reason;
my $rblweighttotal;
my $ip = $this->{ip};
$ip = $this->{cip} if $this->{ispip} && $this->{cip};
return 1 if $this->{rbldone};
$this->{rbldone} = 1;
d('RBLok');
return 1 if !$ValidateRBL;
return 1 if !$CanUseRBL;
##return 1 if $this->{contentonly};
return 1 if $this->{whitelisted} && !$RBLWL;
my $slok
= $this->{allLoveRBLSpam} == 1;
my $mValidateRBL = $ValidateRBL;
$this->{testmode} = $rblTestMode || $allTestMode;
$mValidateRBL = 3
if $ValidateRBL==1 && $switchTestToScoring &&
$DoPenaltyMessage && $this->{testmode};
my $tlit;
$tlit = ""
if $mValidateRBL == 1;
$tlit = "[monitoring]" if $mValidateRBL == 2;
$tlit = "[scoring]"
if $mValidateRBL == 3;
$this->{prepend} = "[DNSBL]";
my $rbl = eval {
RBL->new(
lists
=> [@rbllist],
server
=> $nameservers[0],
max_hits
=> $RBLmaxhits,
max_replies => $RBLmaxreplies,
query_txt
=> 1,
max_time
=> $RBLmaxtime,
timeout
=> $RBLsocktime
);
};
Function 3.16: Function of DNS Blacklist (Part 1)
50
# add exception check
if ($@) {&sigon();return 1; }
my ( $received_rbl, $rbl_result, $lookup_return );
$lookup_return = $rbl->lookup( $ip, "RBL" );
&sigon();
mlog($fh,"error: RBL check failed : $lookup_return") if
($lookup_return ne 1);
return 1 if ($lookup_return ne 1);
my @listed_by = $rbl->listed_by();
my $rbls_returned = $#listed_by + 1;
if ( $rbls_returned > 0 ) {
foreach (@listed_by) {
$rblweighttotal += weight($rblweight{$_}) if
$rblweight{$_};
}
my $rblweight = $rblValencePB;
my $rblweightn = $rblnValencePB;
$rblweight = $rblweightn = $rblweighttotal if
$rblweighttotal;
$reason = $this->{messagereason} = "";
if ( $rbls_returned >= $RBLmaxhits && !$rblweighttotal ||
$rblweighttotal >= $RBLmaxweight) {
pbWhiteDelete( $fh, $ip );
$this->{messagereason} = "DNSBL: failed, $ip listed in
@listed_by";
pbAdd( $fh, $ip, $rblweight, "DNSBLfailed" )
if $mValidateRBL != 2;
$received_rbl = "DNSBL: failed, $ip listed in (";
RBLCacheAdd( $ip,
"1", "@listed_by" ) if $RBLCacheExp
> 0;
Function 3.17: Function of DNS Blacklist (Part 2)
51
} else {
pbWhiteDelete( $fh, $ip );
$this->{messagereason} = "DNSBL: neutral, $ip listed in
@listed_by";
$this->{prepend}
= "[DNSBL]";
mlog( $fh, "[scoring] DNSBL: neutral, $ip listed in
@listed_by" )
if ( $RBLLog && $mValidateRBL == 1 );
pbAdd( $fh, $ip, $rblweightn, "DNSBLneutral" )
if $mValidateRBL != 2;
$this->{rblneutral} = 1;
RBLCacheAdd( $ip,
"1", "@listed_by" ) if $RBLCacheExp
> 0;
$received_rbl = "DNSBL: neutral, $ip listed in (";
}
foreach (@listed_by) {
$received_rbl .= "$_<-" . $rbl->{results}->{$_} . "; ";
}
$received_rbl .= ")";
} else {
$received_rbl = "DNSBL: pass";
RBLCacheAdd( $ip,
"2") if $RBLCacheExp > 0;
return 1;
}
mlog( $fh, "$tlit ($received_rbl)" ) if $received_rbl ne
"DNSBL: pass" && ($RBLLog == 2 || $RBLLog && $mValidateRBL >= 2 );
return 1 if $mValidateRBL == 2;
# add to our header; merge later, when client sent own headers
$this->{myheader} .= "X-Assp-$received_rbl\r\n"
if $AddRBLHeader && $received_rbl ne "DNSBL: pass";
if ( $rbls_returned >= $RBLmaxhits && !$rblweighttotal ||
$rblweighttotal >= $RBLmaxweight) {
my $slok = $this->{allLoveRBLSpam} == 1;
$Stats{rblfails}++;
Function 3.18: Function of DNS Blacklist (Part 3)
52
return 1 if $mValidateRBL == 3;
my $reply = $RBLError;
$reply =~ s/RBLLISTED/@listed_by/g;
$this->{prepend} = "[DNSBL]";
thisIsSpam( $fh, "DNSBL, $ip listed in @listed_by",
$RBLFailLog, "$reply", $rblTestMode, $slok, ( $slok ||
$rblTestMode || $this->{spamlover} ) );
return 0;
}
return 1;
}
Function 3.19: Function of DNS Blacklist (Part 4)
v.
Recipient Validation
Since the information of users in the local domain is kept in LDAP server, an
LDAP query will be done. Provided with an e-mail address of the recipient:
•
Check with local cache for the existance of the e-mail address.
•
If not found, do query to LDAP server
•
If found return value 1. If no record found, return value 0 to the
calling function
•
Keep the e-mail address in the cache
sub LDAPQuery {
my ($ldapflt, $ldaproot, $current_email) = @_;
my $retcode;
my $retmsg;
my @ldaplist;
my $ldaplist;
my $ldap;
my $mesg;
my $entry_count;
$current_email = lc($current_email);
if (exists $LDAPlist{$current_email}) {
mlog(0,"LDAP - found $current_email in LDAPlist") if $LDAPLog;
d("found $current_email in LDAP-cache");
{lock($lockLDAPlist);
my ($vt,$vl) = split(/ /,$LDAPlist{$current_email});
Function 3.20: Function of DNS Blacklist (Part 1)
53
if ($vl) {
$LDAPlist{$current_email}=time." $vl";
} else {
$LDAPlist{$current_email}=time;
}
}
return 1;
}
d("doing LDAP lookup with $ldapflt in $ldaproot");
@ldaplist = split(/\|/,$LDAPHost);
$ldaplist = \@ldaplist;
my $scheme = $DoLDAPSSL? 'ldaps' : 'ldap';
$ldap = Net::LDAP->new( $ldaplist, timeout => $LDAPtimeout,
scheme => $scheme );
if(! $ldap) {
mlog(0,"Couldn't contact LDAP server at $LDAPHost -- check
ignored");
# seterror($fh,"451 Could not check recipient, try later\r\n",1);
&sigon();
return !$LDAPFail;
}
# bind to a directory anonymous or with dn and password
if ($LDAPLogin) {
$mesg = $ldap->bind($LDAPLogin, password => $LDAPPassword,
version => $LDAPVersion);
} else {
# mlog($fh,"LDAP anonymous bind");
$mesg = $ldap->bind( version => $LDAPVersion );
}
$retcode = $mesg->code;
if ($retcode) {
mlog(0,"LDAP bind error: $retcode -- check ignored",1);
$ldap->unbind;
&sigon();
return !$LDAPFail;
}
# perform a search
$mesg = $ldap->search(base
=> $ldaproot,
filter => $ldapflt,
attrs => ['cn'],
sizelimit => 1
);
$retcode = $mesg->code;
&sigon();
# mlog($fh,"LDAP search: $retcode");
if($retcode > 0 && $retcode != 4) {
mlog( 0, "LDAP search error: $retcode -- '$ldapflt' check
ignored", 1 ) ;
#seterror($fh,"451 Could not check recipient, try later\r\n",1);
$ldap->unbind;
return !$LDAPFail;
}
Function 3.21: Function of DNS Blacklist (Part 2)
54
$entry_count = $mesg->count;
$retmsg = $mesg->entry(1);
mlog(0,"LDAP Results $ldapflt: $entry_count : $retmsg") if
$LDAPLog;
d("got $entry_count result(s) from LDAP lookup");
$mesg = $ldap->unbind; # take down session
if($entry_count && $ldaplistdb) {
lock($lockLDAPlist);
$LDAPlist{$current_email}=time;
mlog(0,"LDAP added $current_email to LDAPlist") if $LDAPLog;
d("added $current_email to LDAP-cache");
}
return $entry_count;
}
Function 3.22: Function of DNS Blacklist (Part 3)
3.5.3
Implementation
ASPBF was installed in a company that provided e-mail services. However,
the simulation was carried out in a control environment where a small percentage of
traffics are diverted from the internet to the machine that host ASPBF and another
percentage to be redirected to another machine hosting the antispam without ASPBF.
This was a better simulation compared to a lab simulation where the real traffic
playing the roles and all filtering components would work as they were expected,
while lab simulation may not provide more accurate result.
3.5.4 Simulation Environment
The simulation was conducted in a corporate environment, whose business is
providing email services to internet users. They receives about one million e-mails a
day with estimated about 70% of the total are spam. They have offered to host the
simulation in their environment but they wanted to keep themselves anonymous.
Having this simulation run in the real corporate environment had given
opportunities to have a result that was nearly accurate and helped to reduce efforts
required in preparing a laboratory environment for the same simulation.
55
It was run in a Solaris machine with a 4-core Sparc CPU. The advantage of
using this model was it can provide enough cpu power to host a multi-threaded
application. Each core of the 4-core CPU contains 4 thread processors. Thus,
altogether are 16 CPUs available to the application.
During the simulation, there are two machines used, first for ASPBF and
second is for antispam with Bayesian. The first machine has received about 90,807
e-mail connections in a day equivalent to 10% of total traffic. The second machine
recive about 140,000 e-mail connection equivalent to 20% of total traffic.
As this test was conducted in real enterprise environment and in order to
reduce the risk of being flooded with spam, the test for ASPBF was reduced to only
accept 10% in ensuring less spam to go through as result from no scanning for the
content. At the meantime the second machine will receive 20% due to it will survive
the incoming spam as it will process the content for spam and eliminate them.
Due to heavy load, the application was set to run in multithreaded mode with
a total of 64 threads and this would enable the application to receive a total of 64
concurrent connection. The actual machines are capable to receive about 120
concurrent connections. However, due to lacking of memory management, this test
was limited to 64 concurrent connection only.
3.6
Test Methodology
The test will be carried out by generating traffics with various spam
characteristics, which include:
i.
IP addresses that are rejected by Traffic Shaping
ii.
IP addresses that are rejected by Sender Reputation
iii.
IP addresses that can generate a number of traffic in a short period of time
iv.
IP addresses and domain names that are rejected due to invalid SPF record
v.
E-mails with Invalid recipients
56
Figure 3.23: Placement of the test components
Figure 3.23 shows the placement of the test data, the ASPBF and the
expected outcome. The total of e-mail filter by ASPBF and total e-mail pass through
the filter are crucial in identifying the effectiveness of the filter.
Although the test can be carried out in a lab environment, but the best way to
do it is through real traffic. This will reduce the time to develop the tester program
that able to change IP addresses, change e-mail recipients, change e-mail sender,
populate database with users, populate domain with SPF records and many more that
require more attention than developing the solution.
The total number of e-mails that can pass through will determine the amount
of load required to process the e-mail based on its contents. At the same time, the
total connections dropped will determine the number of unnecessary connection
being dropped to save the processor’s load.
3.7
Summary
Method as described in this chapter was designed in order to meet the
objectives as mentioned in Chapter 1. Having the right structure, the method might
be able to improve overall performance as it introduced elimination process at
network level, less CPU load but faster processing due to less intensive calculation
and in-memory scanning, and to improve catch rate as Bayesian filter was known to
have problem processing image spam. Combination of the mentioned improvements
will lead to a better result as it reduces potential spam and better spam catch rate.
57
The elimination before content processing will also improve performance as it
utilises less CPU.
CHAPTER 4
RESULTS AND ANALYSIS
The focus of this chapter is to discuss briefly the result of the simulation and
analysis of the result in identifying the necessity of the solution in the real world to
fight against spam. The result would then be useful in studying further improvement
in the future.
4.1.
Introduction
The test was carried out in a real corporate environment as described in Chapter 3,
and has given a result as expected. The result consists of outcome from two different
antispam, first was from ASPBF and the second was from an antispam with
Bayesian. Both results will be analysed and compared in order to show that an
improvement has been made from ASPBF implementation. That information will be
useful in finding a proof that it has made improvement to existing antispam and this
will be useful to improve antispam as a whole in the future.
4.2.
Result
The test would capture each of its activity into a log file. Every filter that was
used in identifying spam would be captured as well as to make it easier to know
which filter had done blocking the most.
59
Log of activity of each component of ASPBF will be shown below:
i)
DNSBL
Mar-5-09 00:16:32 83392-05790 [Worker_2] [DNSBL] 24.232.0.248
<apiarg@fibertel.com.ar> [scoring] DNSBLcache: neutral,
24.232.0.248 listed in blackholes.five-ten-sg.com
Mar-5-09 00:16:34 83394-02630 [Worker_11] [DNSBL] 116.12.52.167
<Muse00261d43001b61@email1.bdmg.museondemand.com> [scoring]
DNSBLcache: neutral, 116.12.52.167 listed in ix.dnsbl.manitu.net
Mar-5-09 00:16:38 83398-01115 [Worker_15] 209.173.98.220
<bsgrainy@chibardun.net> Message-Score: total for this message is
77, added 77 for DNSBLcache: failed, 209.173.98.220 listed in
bl.spamcop.net, psbl.surriel.com
Mar-5-09 00:16:50 83410-06242 [Worker_43] 209.133.56.154 <bounceOffers-0a.033n.4.vf08k4qas6.0@Tradepubs.nl00.net> Message-Score:
total for this message is 52, added 52 for DNSBLcache: failed,
209.133.56.154 listed in dnsbl-2.uceprotect.net, dnsbl3.uceprotect.net
Figure 4.1: Logs For DNSBL Filter
Figure 4.1 above shows which IP address was checked against DNSBL’s
records. A “neutral” result showed the IP address was a moderately
dangerous, but a “failed” status showed the IP address was a confirmed
source of spam. The log also showed which source of DNSBL was used.
Some of the reliable sources are spamcop.net, surriel.com, uceprotect.net.
60
ii)
SPF
Mar-5-09 00:16:32 83391-01219 [Worker_5] [SPF] 203.115.234.24
<newsletters@mystar.com.my> to: haiyent@LOCALDOMAIN
SPF: none
(cache) ip=203.115.234.24 mailfrom=newsletters@mystar.com.my
helo=idcsmtp.arc.net.my
Mar-5-09 00:16:39 83397-03276 [Worker_13] [SPF] 209.85.218.180
<botakchun@gmail.com> to: botak@LOCALDOMAIN
SPF: pass (cache)
ip=209.85.218.180 mailfrom=botakchun@gmail.com helo=mail-bw0f180.google.com
Mar-5-09 00:16:42 83398-03745 [Worker_19] 209.85.219.158
<mail355@webairsa.info> to: effis@LOCALDOMAIN Message-Score: total
for this message is 15, added 5 for SPF temperror
Mar-5-09 00:17:05 83417-02279 [Worker_51] 88.231.11.116
<alor@LOCALDOMAIN> to: alor@LOCALDOMAIN Message-Score: total for
this message is 37, added 10 for SPF fail
Mar-5-09 00:17:06 83408-00173 [Worker_42] 209.62.76.34
<globalsteelweb@gmail.com> to: alumac@LOCALDOMAIN Message-Score:
total for this message is 15, added 5 for SPF neutral
Mar-5-09 00:17:06 83408-00173 [Worker_42] [SPF] 209.62.76.34
<globalsteelweb@gmail.com> to: alumac@LOCALDOMAIN [spam found] (SPF
neutral) [News 04 March 09];
Figure 4.2: Logs For SPF Filter
In Figure 4.2 above, all local domain named are replaced with
“LOCALDOMAIN” as to ensure the anonymity of the corporate
environment is kept as required.
In the above example, there were attempts made by spammer to pretend to be
a local user to send spam intended for local user, but local domain is
protected by its SPF records, thus those fake senders are rejected.
61
When “SPF fail” was issued, the connection was immediately rejected as
ASPBF was certain with the SPF records of that senders’ domain and no one
else can pretend to come from that domain.
iii)
Recipient Validation
Mar-5-09 00:16:43 [Worker_20] LDAP Results (&(mail=yatie37@
LOCALDOMAIN)(mailUserStatus=active)): 0 :
Mar-5-09 00:16:43 [Worker_30] LDAP Results (&(mail=geecha@
LOCALDOMAIN)(mailUserStatus=active)): 0 :
Mar-5-09 00:16:43 [Worker_6] LDAP Results (&(mail=kamikaze125@
LOCALDOMAIN)(mailUserStatus=active)): 1 :
Mar-5-09 00:16:43 [Worker_27] LDAP Results (&(mail=matnor@
LOCALDOMAIN)(mailUserStatus=active)): 1 :
Mar-5-09 00:16:43 [Worker_30] LDAP Results (&(mail=geesee@
LOCALDOMAIN)(mailUserStatus=active)): 0 :
Mar-5-09 00:16:43 [Worker_11] LDAP Results (&(mail=chilux@
LOCALDOMAIN)(mailUserStatus=active)): 1 :
Mar-5-09 00:16:45 [Worker_37] LDAP Results
(&(mail=keemin@LOCALDOMAIN)(mailUserStatus=active)): 0 :
Mar-5-09 00:16:46 [Worker_8] LDAP Results (&(mail=paranth@
LOCALDOMAIN)(mailUserStatus=active)): 0 :
Mar-5-09 00:16:46 [Worker_16] LDAP Results (&(mail=greenhub@
LOCALDOMAIN)(mailUserStatus=active)): 0 :
Figure 4.3: Logs For Recipient Validation Filter
Recipient Validation will show either “0” or “1” as shown in Figure 4.3
above. A value of 0 mean invalid recipient while value of 1 means valid
recipient.
62
iv)
Sender Reputation
Mar-5-09 00:40:36 84833-00657 [Worker_23] 201.230.23.225
<thereseq@armkb.com> to: arbayah@tm.net.my [scoring] Blocked
Country PE - Telefonica del Peru
Mar-5-09 00:40:37 84835-09317 [Worker_20] 202.221.246.203 to:
breeranna@bluehyppo.com [scoring] Blocked Country JP - Sumitomo
Mitsui Banking Corporation
Mar-5-09 00:40:39 84837-08322 [Worker_13] 86.54.102.131
<mail.ckxkigkhzrep@interiorsuae.msgfocus.com> to: benel_m@tm.net.my
[scoring] Blocked Country GB - Adestra
Mar-5-09 00:40:39 84836-04453 [Worker_9] 201.77.114.1
<marie79@msn.com> to: rezzack@bluehyppo.com [scoring] Blocked
Country BR - Desktop Online Inform<E1>tica Ltda
Mar-5-09 00:40:40 84837-01719 [Worker_41] 77.253.151.110
<tinaw@dropscout.com> to: ara@streamyx.com [scoring] Blocked
Country PL - Netia SA
Mar-5-09 00:40:40 84838-04337 [Worker_3] 189.41.78.29
<texi@malwaredomains.com> to: arcc@tm.net.my [scoring] Blocked
Country BR Mar-5-09 00:40:43 84841-01628 [Worker_43] 81.222.243.196
<zapolar@km.ru> to: ykcheng@streamyx.com [scoring] Blocked Country
RU - JSC FP Network
Mar-5-09 00:40:43 84840-04668 [Worker_2] 92.124.181.82
<berthaJacob64@msn.com> to: pitt@bluehyppo.com [scoring] Blocked
Country RU - OJSC Sibirtelecom
Mar-5-09 00:40:44 84842-06900 [Worker_1] 203.216.226.119
<annahrichard005@yahoo.co.jp> to: ngadan@tm.net.my [scoring]
Blocked Country JP - Yahoo
Mar-5-09 00:40:46 84835-03436 [Worker_7] 220.231.207.235
<cgfjgjgjvcgdfb@126.com> to: ja3son8t@tm.net.my [scoring] Blocked
Country CN - China Motion Network Communication
Figure 4.4: Logs For Sender Reputation Filter
Figure 4.4 above shows there was some “Blocked Country” but there was no
blocking done but only doing scoring to the message. The score contributed
by this filter will be sum with scores from other filters in the final process to
classifying the message as spam or not.
63
v)
Traffic Shaping
Mar-5-09 02:25:47 91129-05313 [Worker_23] [PenaltyBox]
203.211.131.90 to: yeongwl@tm.net.my [monitoring] totalscore for
203.211.131.90 is 77, last penalty was 'DNSBLfailed'
Mar-5-09 01:21:24 87277-02230 [Worker_48] [PenaltyBox]
207.58.183.173 <machines@myliberta.info> to: wellgrow@tm.net.my
[monitoring] totalscore for 207.58.183.173 is 575, last penalty was
'IPfrequency'
Mar-5-09 06:02:12 04114-07888 [Worker_13] [PenaltyBox]
66.114.120.16 <ibfx4@interbankfx.com> to: znsgroup@streamyx.com
[monitoring] totalscore for 66.114.12
0.16 is 140, last penalty was 'SPFfail'
Mar-5-09 00:16:37 83396-01564 [Worker_2] [MessageLimit]
24.232.0.248 <apiarg@fibertel.com.ar> to: lianhh@tm.net.my [spam
found] (MessageScore 52, limit 50) [BATCH NUMBER 56T DTH78 SAR79];
Mar-5-09 00:16:40 83399-06116 [Worker_10] [MessageLimit]
122.217.103.9 <performerinfo@angel-live.com> to:
saka0805@streamyx.com [spam found] (MessageScore 5
7, limit 50) [B j J h A c s BT C F 9 B];
Mar-5-09 00:16:41 83398-09333 [Worker_17] [MessageLimit]
89.6.178.75 <elemental@beahr.com> to: zahir@bluehyppo.com [spam
found] (MessageScore 53, limit 50)
[More orggasms];
Figure 4.5: Logs For Traffic Shaping Filter
Figure 4.5 shows there were IP addresses filtered due to their previous bad
records in sending e-mails to local server. Besides of that, each IP address
was only allowed to send a limited number of e-mails within certain period of
time. Spammers will always try to send out e-mails as much as possible
within a short period. Therefore, this pattern can be recognized easily and
limiting the number of e-mails to receive from one source will stop that type
of spam.
64
Table 4.1 below shows the detail outcome of the test with using ASPBF as the filter:
Table 4.1: Result of ASPBF
Filter
Total
Traffic Shaping
8182
Sender Reputation
9142 (scoring only)
SPF
540
DNSBL
17,772
Recipient Validation
41,898
Total
68,392 (without Sender Reputation)
However, in the simulation, Sender Reputation was not used to block any
connection but instead only used to contribute some scores to the message which
later to be used for identifying spam similarities.
Out of the total connection which was 90807, about 68392 connections were
dropped or blocked. That means, about 75% of connections were dropped and not
processed for their contents. Without taking Sender Reputation into consideration,
the following Figure 4.6 is showing the effectiveness of the blocking:
Figure 4.6: Result of Filter with ASPBF
65
The figure 4.6 shows that ASPBF will only allow 25% of total connections to
go through with full body content of the messages. This is also showing that
whatever filters that are placed behind ASPBF will require a lot less processing
power as compared to filter without ASPBF.
The next is a test with Bayesian without the ASPBF is described below. The
test was conducted in a different machine. The filters only scanned messages smaller
than 128Kb size. The bigger e-mails than that will pass through without any
processing.
Table 4.2: Result of Antispam without ASPBF
Filter
Total
Pass
55979
No Processing
2250
Recipient Validation
26112
Spam (Bayesian)
42238
Spam (Content)
18775
Total
145354
Figure 4.7: Result of Filter Without ASPBF
Based on Figure 4.7, although it scan only e-mail smaller than 128Kb, but
those e-mail bigger than that are actually just about 1.5% out of total connection.
66
There are about 40% e-mails passed the filter, and 29% were filtered by Bayesian.
Although only 29% is filtered by Bayesian, but the filter was actually scanning all
incoming messages. Therefore the filter using Bayesian technique has actually
worked on 83% out of total connection.
4.3.
Analysis
Figure 4.1 also prove that ASPBF was able to identify spam connections that
could be dropped or rejected by detecting the pattern of the incoming traffic. It
would be then leaving only legitimate connections to proceed with sending
messages.
Although ASPBF filtered about 75% connections, recipient validation was
found to be an important component as it had contributed about half of the blocked
connections. The traffic shaping also received input from this component to filter IP
addresses that were doing a directory harvesting attack. That attack was to gain a list
of users available in local e-mail server.
Sender Policy Framework (SPF) is a new method and not yet implemented
by most of e-mail providers, that is why Figure 4.6 only able to show about 1% of
the traffic can be analysed for SPF records. Although SPF can be avoided in the
filter, but it has also shown that there are e-mail providers using it to take care of
their e-mail servers and to avoid forged e-mails.
But Domain Name Service Black Listing (DNSBL) really showed that it was
able to filter spam sources. Although the accuracy of the data can be questionable,
but that data is a collection from various sources such as individual that receive
spam, spam honeypot and more, in other words, some data is accurate and some
sources that are blacklisted because they have previously sent spam. Therefore
limiting connection from those blacklisted IP addresses is a preventive measures in
blocking spam.
Meanwhile, Figure 4.7 has shown that recipient validation does play an
important role even though it can only block about 17% out of total connections. The
67
rest of the connection will be scanned by Content and Bayesian Filter, which only
about 29% is detected by Bayesian.
4.4.
Conclusions
The test has given a result that shows the importance of each component of
ASPBF and their contribution towards the success of ASPBF. Nevertheless,
Bayesian filter is expected to require less processing load as the unnecessary traffics
are dropped or rejected before they are fed to Bayesian filter for scanning.
CHAPTER 5
DISCUSSION
The development of ASPBF and its simulation have proven the theory to improve
antispam filter. However, there are a lot more angle can be taken into consideration to further
its effectiveness and robustness
5.1.
Introduction
This chapter is discussing the potential of ASPBF as a compliment component to
existing antispam solution for a better of e-mail communication.
5.2.
Discussion
From the result, we can see there was only 25% of the traffic would be accepted and
processed by Bayesian filter. This had shown by implementing ASPBF, it would help the
other filters to work more efficiently and save processing load from the servers.
However, if the antispam is only working at content level, the server might suffer the
processing load due to each scanning and analysis will be required for each e-mail content. In
addition, the server will be congested with concurrent connections due to its limitation in
handling concurrent connections. This will end up with rejection to other incoming
connections and might affect connections without spam.
69
It was proven that ASPBF was improving overall antispam process and
functionalities, its existence compliments the other antispam filters. It also can run on its own
and yet able to reduce spam from the network, eventhough it is not good to end users who
might still receive spam due to no content filters.
Figure 5.1 below show the potential outcome of the combination of ASPBF and
Bayesian, taking into consideration that all connections that pass through ASPBF still contain
about 40% spam.
Figure 5.1: Expected outcome from the combination of ASPBF and Bayesian
Figure 5.1 shows that if ASPBF is combined with other antispam filter, such as
content and Bayesian filter, it will help to eliminate more spam and allow only a small
amount of e-mails and this would help to make sure the existing e-mail storage is able to
utilize its capacity for legitimate e-mails.
As the spam e-mails were also terminated at network level, it helped to reduce
congestion in the network which host the e-mail servers, thus the e-mail servers would accept
more legitimate e-mails and prevent themselves from delaying e-mails. The lesser e-mails
received will also save the processing load to scan each e-mail contents. All of these have
shown the overall process to scan and eliminate spam has improved.
5.3.
Conclusion
The result shown in this chapter has proven that ASPBF has met its objectives. Although
combining both ASPBF and Bayesian might give a better result, but performance and
robustness of the application shall be taken into consideration. The way the solution is
developed shall be studied further to have a more robust and reliable solution.
CHAPTER 6
CONCLUSION
The antispam has a potential to be improved by adding more components to
its main system. But choosing the right components is crucial in ensuring the success
of the solution. ASPBF has shown the potential choices to make in focusing only for
network level protection.
6.1.
Introduction
This chapter concludes the study by showing the right direction of antispam
in the future and the alternatives currently available in improving the antispam.
6.2.
Conclusions Remarks
The spam is still evolving and so the antispam solutions, but both are for
different agenda, one is to reach users the other one is to prevent users from
receiving spam. This competition will never end and will never satisfy everyone.
The spammers will also look into the evolution of antispams in order to
understand them and to avoid the detection and blocking. On the other hand, the
antispam developer will work on the existing spam techniques in order to improve
the filter to cater for the latest methods. The development of antispam is more
towards a reactive measure.
71
However, it is hard to anticipate the next generation of techniques to send
spam as those are only created based on the weaknesses of antispam. Therefore
continuous efforts in improving antispam is the only way to fight the spam even
though it will only be able to reduce spam rather than eliminate them in total.
6.3.
Future Work
ASPBF has a potential to be improved further especially by adding more
techniques and filters to scan an e-mail or a message at header level. The objective of
ASPBF is to filter at network level, but filtering messages at their header will also
help in identifying the untrusted sources. This is another compliment component to
ASPBF as well as to Bayesian, but requires further study to ensure its functionality
will meet its objective without jeopardizing the performance.
Here are several examples of techniques that can be used in filtering more
connections through scanning the header, such as :
i.
validating the HELO command
HELO or EHLO command is a common handshake mechanisme done by
both sending and receiving servers. The genuine e-mail server provide its
genuine hostname in the command. While, a spammer’s server will never
send its genuine hostname to hide themselves from being recognized.
Therefore, taking advantage of this command would lead to a better
antispam.
ii.
validating the Message-ID
Each message will be tagged with a message-ID by the sending server for
recording and tracking purposes. But, a spammer will normally used a
modified e-mail software to send spam directly to the receiving servers
bypassing the sending servers. Thus, the legitimate e-mail will have a
72
message-ID in a common format, while the spam will not have it or have it
but with wrong format.
iii.
validating the Forged-HELO.
Another ways to provide a hostname in the HELO command is by using the
receiving server’s domain name as the hostname. This can considered as
forging the domain, thus detecting the forged-HELO can be considered as
another solution.
iv.
DKIM verification
Domain-Key Identification Management is another way to sign the e-mail
and verify it has come from a trusted source. This method is new but already
being implemented by Yahoo! and google. Further study is required to make
use of this information to verify the genuinity of the message and source.
Those techniques will work at the header level of the messages before the full
content is accepted but it has passed all filters at network level. If the header fails the
test, the connection is dropped and no data will be received and therefore no needs to
process using Content and Bayesian filters.
Although adding more component to the filter will improve overall
processes, but one must bear in mind fighting spam is a continous effort. An obstacle
should not stop the effort but shall be considered as a motivation for a better
antispam in the future.
73
REFERENCES
Bekman, S. (2006). Anti-SPAM Techniques: Bayesian Content Filtering. Retrieved
on Mayh 15th 2006, from http://stason.org/articles/technology/e-mail/junkmail/bayesian_content_filtering.html
Foxman, E.R, and Schiano, W.T. (2000). Inspecting Spam Unsolicited
Communications on the Internet. Challenges of Information Technology
Management in the 21st Century: 2000 Information Resources Management
association International Conference. May 21-24, 2000. Anchorage, Alaska.
552.
Karlberger, C., Bayler, G., Kruegel, C., and Kirda, E. (2007). Exploting Redundancy
in Natural Language to Penetrate Bayesian Spam Filters. FWF Austrian Science
Fund. 2007.
Kim, H.J, Kim, H.N, Jung, J.J, and Jo, G.S (2004). Spam Mail Filtering System
Using Semantic Enrichment. WISE 2004, LNCS 3306. November 22-24, 2004.
Brisbane, Australia. 619-628.
Liang, G., Li, T., Gong, X., Jiang, Y., Yang, J., and Ni, J. (2006). NASC: A Novel
Approach for Spam Classification. ICIC 2006, LNBI 4115. August 16-19, 2006.
Kunming, China. 672-681.
Mehta, B., Nangia, S., Gupta, M., Nejdl, W. (2008). Detecting Image Spam using
Visual Features and Near Duplicate Detection. International World Wide Web
Conference Committee (IW3C2). April 21-25 2008, Beijing China. 497-506.
McDonald, A. (2004). SpamAssassin: A Practical Guide to Integration and
Configuration. Packt Publishing Ltd.
Nhung, N.P., and Phuong, T.M. (2007). An Efficient Method for Filtering ImageBased Spam E-mail. Computer Analysis of Images and Patterns: 12th
International Conference, CAIP 2007. August 27-29, 2007. Vienna, Autria. 945953.
Prakash, V.V., and O’Donnel, A. (2005). Fighting Spam with Reputation Systems.
Social Computing. Vol. 3 (Issue No 9). ACM.
74
Ramachandran, A. and Feamster, N. (2006). Understanding the Network-Level
Behaviour of Spammers. SIGCOMM’06. September 11-15 2006, Pisa Italy.
291-302.
Schwartz, A. (2004). SpamAssassin. California. O’Reily.
Schultz, B. (2004). TurnTide Stopping Spammers at Their Own Servers. Network
World. 2004. Retrieved on April, 26, 2004. From
http://www.networkworld.com/nw200/2004/0426watch9.html
Xiaochun Cheng, Xiaoqi Ma, Long Wang, and Shaochun Zhong (2005). A Mobile
Agent Based Spam Filter System. Computational Intelligence and Security:
International Conference, CIS 2005. December 15-19, 2005. Xi’an, China. 422427.
Zdziarski, J.A. (2005). Ending Spam: Bayesian Content Filtering and the Art of
Statistical Language Classification. San Francisco. No Starch Press, Inc.
.
Download