IMPROVING ANTISPAM TECHNIQUES BY EMBRACING PATTERN-BASED FILTERING HAIRUL ANUAR BIN MAT NOR UNIVERSITI TEKNOLOGI MALAYSIA IMPROVING ANTISPAM TECHNIQUES BY EMBRACING PATTERN-BASED FILTERING HAIRUL ANUAR BIN MAT NOR A project report submitted in partial fulfilment of the requirements for the award of the degree of Master of Science (Information Security) Centre for Advanced Software Engineering Faculty of Computer Science and Information System Universiti Teknologi Malaysia APRIL 2009 iii Dedicated to my beloved wife and kids iv ACKNOWLEDGEMENT In preparing this thesis, I received guidance and encouragement from my supervisor, Dr. Rabiah Ahmad, without whom I might not be able to complete this thesis. I extended my appreciation to all of my colleagues and others who have provided assistance and support. Thank you to my beloved wife and kids for their support. v ABSTRACT This study attempted to show that there are still away to improve antispam system. The classical method of filtering spam is by inspecting content of an e-mail and finding a matching pattern with a predefined ruleset. Each matched keyword or sentence will produce a weight, also called score, which will be combined to produce the final score and later to be used for identifying spam similarities in the message. Spammers keep changing a style in generating a spam message to avoid being filtered. Bayesian technique was found to be suitable to embed in the antispam in order to recognize the characteristic of spam in a message eventhough the content has been changed. However, the Bayesian introduced difficulties to the system as spammers have changed the way they send the spam specifically to bypass Bayesian filter. Thus, it is time to find a way to filter those spam. Normally an antispam works on the content, but there is a possibility to filter spam based on its pattern of delivering the spam at network level to reduce the congestion in the network. Filter at network level is also benefiting the server as it has eliminated some spam before they are received and processed for the content. A study were conducted to show above statement is true. An Antispam with Pattern Based Filter (ASPBF) will be tested and the result will be compared with the test for Antispam with Bayesian. The comparison result will determine how much it has achieved its objectives. This study will be able to stimulate more studies in the future to further improve antispam solution in the fight against spam and to have a better e-mail communications. vi ABSTRAK Kajian ini dilakukan untuk membuktikan bahawa masih terdapat peluang untuk memperbaiki sistem antispam. Cara biasa mengatasi masalah spam adalah dengan memeriksa kandungan emel dan mencari kesamaannya dengan himpunan maklumat kandungan spam yang tersedia ada. Ini telah memaksa para pembuat spam mengubah cara mereka mengeluarkan spam bagi memastikan ia tidak ditapis. Susulan daripada itu, para penyelidik mendapati teknik yang digunakan Bayesian adalah sesuai untuk menjejaki emel yang mengandungi ciri-ciri spam walaupun isi kandungan e-mel tersebut diubah bagi mengaburi antispam. Tetapi, teknik Bayesian didapati tidak lagi releven malah telah mengakibatkan keburukan yang lain terutama sekali bila para pembuat spam mengubah kandungan emel semata-mata untuk mengaburi teknik Bayesian. Sungguhpun begitu, spam masih boleh dikenal pasti melalui cara ia dihantar, oleh itu mengecam corak spam dihantar pada peringkat rangkaian mampu mengekang kepadatan rangkaian dan mampu mempertingkatkan proses antispam kerana kebanyakan e-mel telah disekat dan tinggal sebilangan kecil sahaja yang akan diproses oleh antispam yang mengkaji kandungan e-mail. Kajian ini juga membangunkan satu aplikasi antispam, dinamakan Antispam with PatternBased Filter (ASPBF), berupaya mengecam corak penghantaran spam dan memutuskan sambungan spam tersebut. Hasil ujian ini akan membuktikan ia telah berjaya memenuhi objektifnya dalam mempertingkatkan kelancaran dan keberkesanan antispam. Hasil ujian ke atas aplikasi tersebut dan aplikasi antispam dengan Bayesian akan dibandingkan bagi menunjukkan keberkesanan aplikasi tersebut. Kajian ini diharapkan bakal menjadi perangsang kepada kajian-kajian lain dalam mempertingkatkan kemampuan antispam bagi memperoleh dunia komunikasi e-mel yang lebih baik. vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENT vii LIST OF TABLES x LIST OF FIGURES xi LIST OF ABBREVIATIONS xii INTRODUCTION 1 1.1 Introduction 1 1.2 Evolution of Spam and Ways to Filter 1 1.3 Problem Statement 3 1.3.1 Limitations of Bayesian 3 1.3.1.1. Image Spam 5 1.3.1.2. Open Relay and Proxies 6 1.3.1.3. Botnets 7 1.3.1.4. Spam with Dynamic Contents 8 1.3.2 Available Solution in Resolving Issues with 8 Bayesian 1.4 Definition of Patterns 9 viii 2 3 1.5 Objective 10 1.6 Scope 12 1.7 Hardware Specification for Simulation 13 1.8 Summary 13 LITERATURE REVIEW 14 2.1 Introduction 14 2.2 Antispam Solutions without Bayesian 14 2.2.1 Image Spam 15 2.2.2 Botnets and Network Level Protection 17 2.2.3 Dynamic Contents 23 2.3 Antispam with Bayesian 26 2.4 Summary 27 METHODOLOGY 29 3.1 Introduction 29 3.2 Methodology 29 3.2.1 Traffic Shaping 30 3.2.2 Sender Reputation 31 3.2.3 Sender Policy Framework 31 3.2.4 DNS Blacklist (DNSBL) 33 3.2.5 Recipient Validation 34 3.3 Improvement To Expect 34 3.4 Sample Data 35 3.5 Requirement for the simulation 36 3.5.1 Development Tool 36 3.5.2 Prototyping 37 3.5.3 Implementation 54 3.5.4 Simulation Environment 54 3.6 Test Methodology 55 3.7 Summary 56 ix 4 5 6 RESULT AND ANALYSIS 60 4.1. Introduction 60 4.2. Result 60 4.3. Analysis 66 4.4. Conclusion 67 DISCUSSION 68 5.1 Introduction 68 5.2 Discussion 68 5.3 Conclusion 69 CONCLUSION 70 6.1 Introduction 70 6.2 Conclusion Remarks 70 6.3 Future Work 71 REFERENCES 73 x LIST OF TABLES TABLE NO. TITLE PAGE 1.1 Specification of The Computer For The Simulation 13 2.1 Comparison of Methods to Detect Image Spam 17 2.2 Comparison of Network-Level Protection Techniques 22 2.3 Comparison of Techniques to Cater For Dynamic 26 Contents 2.4 Recommended Combination for Pattern-Based 27 Filtering Technique 2.5 Recommended Combination for Pattern-Based 28 Filtering Technique (Continue) 4.1. Result of ASPBF 64 4.2. Result of Antispam without ASPBF 65 xi LIST OF FIGURES FIGURE NO. TITLE PAGE 1.1 Sample of Image Spam 5 1.2 Architecture of EMP 8 2.1 Sample of Distorted Text in Image 15 2.2 Examples of spam images 16 3.1 System Architecture (part 1) 32 3.2 System Architecture (part 2) 33 3.3 Function of Traffic Shaping (Part 1) 37 3.4 Function of Traffic Shaping (Part 2) 38 3.5 Function of Sender Reputation (Part 1) 38 3.6 Function of Sender Reputation (Part 2) 39 3.7 Function of Sender Reputation (Part 3) 40 3.8 Function of Sender Reputation (Part 4) 41 3.9 Function of Sender Reputation (Part 5) 42 3.10 Function of Sender Reputation (Part 6) 43 3.11 Function of Sender Reputation (Part 7) 44 3.12 Function of Sender Policy Framework (Part 1) 45 3.13 Function of Sender Policy Framework (Part 2) 46 3.14 Function of Sender Policy Framework (Part 3) 47 3.15 Function of Sender Policy Framework (Part 4) 48 3.16 Function of DNS Blacklist (Part 1) 49 3.17 Function of DNS Blacklist (Part 2) 50 3.18 Function of DNS Blacklist (Part 3) 51 xii 3.19 Function of DNS Blacklist (Part 4) 52 3.20 Function of DNS Blacklist (Part 1) 52 3.21 Function of DNS Blacklist (Part 2) 53 3.22 Function of DNS Blacklist (Part 3) 54 3.23 Placement of the test components 56 4.1 Logs For DNSBL Filter 59 4.2 Logs For SPF Filter 60 4.3 Logs For Recipient Validation Filter 61 4.4 Logs For Sender Reputation Filter 62 4.5 Logs For Traffic Shaping Filter 63 4.6 Result of Filter with ASPBF 64 4.7 Result of Filter Without ASPBF 65 5.1 Expected outcome from the combination of ASPBF 69 and Bayesian xiii LIST OF ABBREVIATIONS ASPBF - Antispam with Pattern-Based Filter Botnet - Robot Network CNC - Cloudmark Network Classifier CPU - Central Processing Unit DCOM - Distributed Component Object Model DKIM - DomainKeys Identified Mail DNS - Domain Name Service DNSBL - Domain Name Service Black Listing E-Mail - Electronic Mail EMP - Extensible Messagin Platform IP - Internet Protocol LDAP - Lightweight Directory Access Protocol LSASS - Local Security Authentication Server OCR - Optical Character Recognition RAM - Random Access Memory SMTP - Simple Mail Transfer Protocol SPF - Sender Policy Framework SVM - Support Vector Machine CHAPTER 1 INTRODUCTION 1.1 Introduction This chapter forms the introduction to the thesis. Discussions begin with emphasis on the history of spam and ways taken to fight against spam, including the use of Bayesian technique, which is later found to be causing another problem. 1.2 Evolution of Spam and Ways to Filter The growth of technologies especially those related to internet usage has gradually been producing side effect that are damaging one’s productivity and image when a precaution steps are not taken seriously. One of it is spam, less internet users aware the severity of damage from spam. In addition, there are many users do not know anything about spam, yet faced it in daily business. Zdziarski (2005) claimed that many people described spam as unwanted emails, while others defined it as “unsolicited commercial e-mail” which covered advertisements for products and other types of solicitations. Eventually, the meaning was leaning towards “unsolicited bulk e-mail” and Zdziarski (2005) suggested it was more acceptable. It is commonly known that spam is sent in bulk intended to reach as many people as it could to deliver messages that are meant to solicit. Therefore, the beginning of mass spam distribution could be traced back to year 1994. 2 Lawyers Laurence Canter and Martha Siegel implemented the first large- scale spam in 1994, especially when they sent a message offering help with the Immigration and Naturalization Service’s “green-card lottery” to thousands of Usenet newsgroups. Thousands of angry recipients sent nasty messages to NETCOM, Canter and Siegel’s Internet Service Provider, causing the provider’s machine to overload and crash (Foxman and Schiano, 2000). Since then, people started to protect their e-mail servers from spam, thus creating a huge demand on antispam products. This had motivated companies to produce antispam solutions in rapid mode by taking advantage from readily available antispam software developed by open source community. One of the popular open source antispams is called SpamAssassin which was developed by using Perl language and the implementation was meant for UNIX or Unix-like environment. Schwartz (2004) described SpamAssassin as a software for analyzing e-mail messages, determining how likely they were to be spam and reporting its conclusions. It was a rule-based system that compared different parts of e-mail messages with a large set of rules. Each rule would add or remove points from the message’s spam score. A message with a high enough score would be reported as a spam. There are numbers of rules embedded within SpamAssassin which contribute scores to the final decision. Each rule has its own weight where the most effective rules will be given more weight. However, users are also given chances to modify the score of each rule or to add new rule sets. SpamAssassin combines message format validation, content-filtering and the ability to consult network-based blacklists (Schwartz, 2004). These are the typical ways to filter spam. In general as described by Mendez (2008), there are two kinds of spam filtering techniques: (i) collaborative system and (ii) content-based approaches. The former are based on sharing identifying information about spam messages within a filtering community, while content-based approaches analyzed message content in order to identify its class (usually using machine learning techniques) 3 In the early age, spam could be found in a typical way. However, as time goes by spammers were getting more creative and they produced spam in different ways making the spam more dynamic and less chances to be filtered by antispam. Thus, artificial intelligence was needed in order to countermeasure the dynamic spam attacks. That was why Bayesian techniques were embedded with SpamAssassin to improve the catch rate. Bayesian Filter in SpamAssassin is one of the most effective techniques for filtering spam. It is a mathematical statistical analysis. Bayesian analysis involves teaching a system that a particular input gives a particular result. For spam filtering, this teaching is repeated, many times over with many spam and ham (legitimate) emails. Once this is finished, a Bayesian system can be presented with a new e-mail and will give a probability of the result being spam. However, for best results, teaching should be a constant process (McDonald, 2004). 1.3 Problem Statement As enhancements are continuously introduced to antispam solutions and Bayesian was always the main focus for its capabilities to detect spam using probabilistic approach. However, amount of traffic and amount of spam increased over time causing Bayesian technique to undergo performance degradation. In relation to Bayesian technique, this section will primarily discover the limitations of Bayesian technique. 1.3.1 Limitation of Bayesian Bayesian technique used in antispam is primarily meant for probability and similarity of sentences and keywords with actual spam. Therefore, Bayesian can only be used for spam that is readable by a machine, which means words are in correct spelling sequence. 4 A problem appeared when spammers started to use non-readable words, where other characters are introduced in the middle of the spelling. Machines will face problem recognising the words, but human eye will be able to identify the words as the words and the unknown characters are differentiated by colours or by any other mean. This is where Bayesian is gradually failing its operations. More unfortunate to Bayesian filter, spam attacks not only can slip through the filter due to the unreadable form of message, latest research has shown that readable form of message also can defeat the Bayesian filter. Research by Karlberger et al. (2007) had shown that existing attack aiming at Bayesian is to let spam mails be identified as ham mails. Currently existing attacks aim to achieve this by adding words to the spam mails. The objective is that these additional words are used in the classification of the mail in the same way as the original words, thereby tampering with the classification process and reducing the overall spam probability. When the additional words are randomly chosen from a large set of words, this is called random word attack, also called “word salad”. The objective is that the spam probabilities of words added to the spam message should compensate for the original tokens’ high spam probabilities in the calculation of the whole message’s combined spam probability. This has shown that spammers has turned significant energy to building software which cloaks their spam to hide the trigger words and phrases from Bayesian filters, the resulting messages become progressively simpler for pattern based scanners to correctly identify as spam. This is because pattern based scanners can identify spam by actually targeting the cloaking techniques the spammers are using, rather than worrying about identifying spam based on words and phrases. There are significant patterns that can be recognised from a spam e-mail as spam is normally sent in bulk targeting a group of people. Therefore, there are many similar spams are flowing around in the internet. Thus, it is a good idea to take advantage on this characteristic to enhance the filtering process. However, not just pattern in the spam e-mails are always similar, but techniques of sending the spam also found to be similar. This has increased 5 awareness among antispam developer to look at filtering the spam before the spam actually reach the e-mail server, thus saving a lot of processing power. These two types of pattern are not the centre of Bayesian’s focus even though it can be of assistance to an antispam to perform better. In the event some spams will slip through the pattern-based filter, Bayesian can be used as the last line of defence. 1.3.1.1 Image Spam Image spam is another form of spam due to its effective ways to deceive spam filters since they can only process text (Mehta et al. 2008). Images are hard to process for identifying spam as they only contain non-readable characters, which can only be interpreted by human’s eye. Figure 1.1 below show one example of image spam. Figure 1.1: Sample of Image Spam Although Optical Character Recognition (OCR) was introduced to extract letters and keywords from an image, but it has given a big impact on overall processing load. Mehta et al. (2008) also emphasis that images in spam e-mails contain text messages conveying the intent of the spammer. This text is usually an advertisement and often contains text, which has been blacklisted by spam filter. 6 Since legitimate e-mails may also contain images in the attachment, therefore it is hard to differentiate legitimate e-mail and spam e-mail if the detection cannot filter spam in an image. Besides, detecting characteristic of the e-mail with image spam may not the best solution as the characteristics are almost similar to the legitimate e-mails. In further research by Baskhar et al. (2008), they found the sophisticated techniques like Optical Character Recognition (OCR) can fail to recognize text in images due to the distinction of every image spam, where more noises where added, random background, colour of the fonts and many more. 1.3.1.2 Open relay and proxies Open relay is a method to provide a convenience way to end users enabling them to send e-mails without any authentication that require manual intervention each time. This method also simplifies communication between internal server and the upstream server that connect to the internet. However, open relays have been exploited by spammers due to the anonymity and amplification offered by the extra level of indirection. It appears that the widespread deployment and use of blacklisting techniques have all but extinguished the use of open relays and proxies to send spam (Ramachandran and Feamster, 2006). Although smtp authentication is widely used nowadays to prevent unauthorized use of the e-mail server, but there are still a lot of e-mail providers that are not taking serious action to overcome this issue as it does not really affect their servers. That is not the only problem, but spammers are developing their smtp server, a tiny version, in order to find a simpler way to broadcast the e-mail rather than finding an open relay servers which may take time to discover. This tiny smtp server is then embedded into botnet, therefore they will have a large distribution of open relay servers that are ready for use at anytime. 7 1.3.1.3 Botnets Today many people claimed that most of the spams are coming from botnet, where botnet is an interpretation of a collection of machines acting under one centralized controller. Normally a machine might get affected by worms that exploiting Windows vulnerabilities opening backdoor to allow remote attacker to take control of the infected machines. As mentioned earlier, the botnets might have a tiny smtp server application, thus whenever the backdoor is opened to the attackers, they will have a direct access to the smtp server to launch spam attacking everyone in the world. This will hide the real spammers, but put the blame on the innocence. But not necessarily the botnets to have its own smtp server, it can exploit vulnerabilities in the windows machine to act as smtp server. As mentioned by Ramachandran and Feamster (2006), worms exploiting DCOM and LSASS vulnerabilities on windows systems allows infected hosts to be used as a mail relay, and attempts to spread itself to other machines affected by aforementioned vulnerabilities, as well as over e-mail. 1.3.1.4 Spam With Dynamic Contents Content-based filters, such as those incorporated by popular spam filter like SpamAssassin, successfully reduce the amount of spam that actually reaches a user’s inbox. On the other hand, content-based filtering has drawbacks. Users and system administrators must continually update their filtering rules and use large corpuses of spam for training; in response, spammers is negligible, since spammers can easily alter content to attempt to evade these filters (Ramachandran and Feamster, 2006). Although Bayesian technique was initially found to show high performance precision and recall, it has several problems. First, it has a cold start problem, that is training phase has to be done before execution of the system, the system has to be 8 trained about spam and non-spam mail. Second, the cost of spam mail filtering is higher than rule-based system. Third, if an e-mail has only few terms those represent its contents, the filtering performance is fallen (Kim et al, 2004). In that sense, the Bayesian works based on similarities to the previously found spam and non-spam. However, the newly designed spam may not be recognized by Bayesian, thus require a human intervention to provide an input for retraining and to get familiar with the latest content. But, while waiting for a human input, damages might have been done and reputation of the e-mail providers might get poorer. 1.3.2 Available Solution in Resolving Issues with Bayesian This study was inspired by a solution commercially made available, namely Extensible Messaging Platform (EMP), which is focusing on recognising spam from its pattern and signature. Although the solution is ready off-the-shelf, but there are some characteristics that may affect performance and reliability. Figure 1.2 below shows the architecture of the EMP solutions: Figure 1.2: Architecture of EMP 9 From the given architecture, there are some factors that may lead to a decreasing performance: i. The technique used to do sender reputation is based on blacklist database which information are collected and shared from various input all over the world, thus requesting the information in real-time requires a reliable connections. This is a very resource intensive process especially when the amount of the traffic is too huge. ii. The EMP is acting as an smtp software which create more load to the server as it needs to run two instances of different smtp software. iii. The signature of spam shall be downloaded from the central repository in order to make the pattern-based filter to function properly, it would be better if the machine is self-dependent where it shall develop its own database of signature from the observation and analysis on the incoming traffic. Although a commercial product is available to provide pattern-based filter, this study will look deeper into ensuring the product to maintain its performance for a longer period of time even though the amount of traffic increases. Besides, this is an academic study to appreciate researches by earlier researchers for their contribution in improving antispam solution. This study shall be more valuable and acceptable when each of the proposed technique in the past research is put into a test. 1.4 Definition of Pattern The use of ‘pattern’ is referring to the following definitions which depend on which protection level it may be used: i. At network level At this level, the pattern referring to behaviour of an IP address in requesting for SMTP connections. Some of the patterns are: ii. 10 • The large amount of traffic in a short period of time. • Number of invalid recipient from an IP address in a short time. • Invalid IP address or IP address without specific DNS record. At e-mail content level At this level, the some characteristic, especially those listed below will be categorized as pattern: • Invalid timestamp. • Invalid header information. • Keywords or mix of keywords matched with database of spam keywords. iii. At e-mail attachment level The following items are the pattern to look for: • Image containing spam keywords. The pattern can be figured out by recognising the properties of the image and compare it with template of image spam. 1.5 Objective This study will be focusing on three main objectives: i. To study other solutions to improve the percentage of catch rate and to reduce processing load. 11 This study will try to look for other antispam solutions which are avoiding Bayesian in its implementation. Bayesian is known with its probabilistic approach which requires a lot of sample data and yet intensive calculation for a higher accuracy. With a large data and calculation it will require more processing power which reduces overall performance of an antispam product. Thus, a simple way known as pattern-based filtering will serve a better performance as its functionality does not need to thoroughly look into the content of an e-mail. ii. To analyse combinations of existing solutions to form the best pattern-based filtering technique. There are studies done by researchers that tried to improve anti spam filtering without using Bayesian filter technique in the implementations. However, in order to prove they had improved the solutions, they had to compare their results with the result from antispam product with Bayesian filter. However, each of these techniques was only focusing on specific function and parameter. Therefore, this study will help to determine the right combination of techniques that could potentially improve the antispam solution without the need to include Bayesian filter. iii. One antispam system will be developed at the end of this study. The said system will be formed by a combination of selected techniques. The said system will be developed without Bayesian. But, for a comparison purposes, result from an existing product that is using Bayesian will be used to compare with the result of the newly developed system to prove that an improvement has been made. 1.6 12 Scope This study is introducing techniques to improve antispam, especially to reduce the erroneous and the cost of performance generated by Bayesian filter. Although there are numbers of techniques focusing on other areas of improvement, this paper will not study the foundation of those techniques. This study will further study on the suitability and compatibility of each of mentioned techniques in order to form the pattern-based filtering technique to be used in an antispam solution or product. Having the right tool and a controlled environment, the said antispam system will be developed by using pattern-based filtering, and will be tested for a comparison study against other antispam solution that is using Bayesian. The proposed antispam system shall have three levels of protection where the pattern-based filtering can be applied: i. Network Level ii. E-mail Content Level iii. E-mail Attachments Level However, as time is a major constraint factor, the scope of this study will be limited to focusing on implementing the protection at network level. Although the focus is small, but it shall meet its objective as filtering at network level will reduce spam in the server, thus improving accuracy of spam blocking. It will also eliminate extra load in the server when number of e-mail to process is getting less. For the mentioned test, a collection of spam will be fed into the mentioned antispam system, where these samples are actually collected from actual spam. However, the samples are limited to spam written in English characters only and the image spam will not the scanned for optical character recognition (OCR). 13 1.7 Hardware Specification for simulation An existing computer with the following specification will be used to develop the mentioned antispam system: Table 1.1: Specification of the Computer for the Simulation. No. Type Specifications 5. CPU AMD Phenom (Quad Core) 6. RAM 2 GB 7. Hard Disk 500GB 8. Operating System OpenSolaris A justification on choosing a unix machine will be in Chapter 3. 1.8 Summary Problems as described in this chapter have shown that there is a need to improve the existing techniques and methods for a better result includes improving the utilization of system resources. As the problems are narrowing into catch rate and performance issues, therefore this chapter focuses on three objectives in resolving the issues. Firstly, study on other solution. Second objective is to form a patternbased filtering technique based on combination of solutions. Third, to developed an antispam solution containing the said solution and perform a performance evaluation for a comparison with Antispam with Bayesian filter. 14 CHAPTER 2 LITERATURE REVIEW In this chapter, some preliminary studies on antispam are reviewed, especially on spam filtering techniques that do not require Bayesian to be in place. This chapter discusses the approaches taken by other studies and its potential to be part of the new system. 2.1 Introduction In order to improve the efficiency of an antispam product, a few researches have been carried out, but each of those focused only on specific way to improve specific task that had not been resolved by Bayesian for example image spam and network level protection. Pattern-based filtering is a combination of techniques that use pattern rather than statistical method as introduced by Bayesian in recognizing spam in an e-mail. Nevertheless, pattern-based alone will not enough to filter all spam but it is enough to eliminate problem caused by Bayesian. 2.2 Antispam Solutions without Bayesian This section describes types of spam and measures to prevent it from attacking a network as a result from studies all over the world. The importance of these methods and techniques will help this study to find a better solution. 15 2.2.1 Image Spam As spam evolved into a new look, researchers were looking for ways to protect one network from potential threats caused by spam. Image spam is the term used where spam is not in plain text form anymore, but it is an image that carries the text. Hence, Optical Character Recognition (OCR) was developed to fulfil the needs. A possible way to detect image-based spam is using a pipeline of an optical character recognition (OCR) system that extracts and recognizes embedded text, followed by a text classifier that separates advertising text from legitimate content. While this solution promises to detect spam with a certain level of accuracy, the existing OCR algorithms are computationally expensive and thus cannot operate on heavily loaded e-mail servers. (Nhung & Phuong, 2007) Thus, overall process taken by OCR is not efficient where it will need higher processing load and time. The deficiency continues with the inability to recognize distorted text in the image. Figure 2.1 is an example of distorted text. Figure 2.1: Sample of Distorted Text in Image Works by Mehta et al. (2008) has shown that they have found a way to detect image spam based on Near-Duplicate detection algorithm which is based on the intuition that they can recognize a lot of images similar to an identified spam images, where image spam usually generated from a template. Given enough training data, they should be able to detect large volumes of image spam, while being open for further training. However, that technique is based on a trained data, it needs to store the characteristics of each image spam in order to use it to recognize similar image spam in the future. The disadvantage of this technique is that it will slow down the overall process when the database getting huge. 16 However, Nhung and Phuong (2007) used a slightly different technique, where the method was also not to extract text from an image. Instead, it uses an edge-based feature vector, which can be computed efficiently, to represent major shape properties of the image. Since most of spam images contain large proportions of text as of Figure 2.2 below, they must have shape representation similar to that of the other text intensive images. (a) Image containing only text (b) Image with photographic elements Figure 2.2: Examples of spam images Nhung and Phuong (2007) claimed that they have described a new method for detecting spam e-mail with content embedded in images. Given an image, their method does the following: i. Firstly, extracts an edge-based feature, which summarizes the global information of the image. ii. Then, it computes a vector of similarity scores between the image and a set of templates that contain only text. iii. The similarity vectors then serve as input for Support Vector Machine (SVM) training and classification. The use of edge-based feature allows capturing regularities in shapes of text intensive spam images while does not require costly computations. The method is fast because it does not use computationally expensive image processing and text recognition steps. It was claimed that, the method separates spam images from non- 17 spam ones with accuracy of 80% and higher. Their method uses SVM combined with vector representation. As a comparison, table below will show the disadvantages and disadvantages of those three methods as mentioned above: Table 2.1: Comparison of Methods to Detect Image Spam Solution OCR Advantages Good at text recognition Disadvantages Fail to detect right text in distorted form Near-Duplicate detection Good at finding algorithm similarities in to stop mass and require more hard disk spread of image spam Require to train the data space and intensive calculation SVM with vector Good at differentiating The accuracy is at 80%, representation image spam and non- which means a lot of image spam by measuring image spam will slip the properties of the image through. for comparison with templates 2.2.2 Botnets and Network Level Protection There was another study trying to move the focus out of content filtering but targeting to recognizing the spam from network-level characteristics. Ramachandran and Feamster (2006) had stressed out that much attention has been devoted to studying the content of spam, but comparatively little attention has been paid to spam’s network-level properties. Conventional wisdom often asserts that most of today’s spam comes from botnets. However, based on the study on a botnet named Bobax, they found that during the 17-month observation, Bobax did not send a large amount of spam regardless of how long the botnet was active. More than 65% of 18 infected hosts sent spam only once, and nearly 75% of IP addresses send spam for less than two minutes, although many of them send several e-mails during their existence. In that sense, paying attention to filtering spam based on the characteristics of botnet is not worth awhile. The resources collected from all over the world will be left unutilised, even though having such resource is a good practice for protecting the network from future threat caused by botnets. However, some antispam products already introduced the following techniques in order to cater for network level protection: i. Sender reputation, enable a system to collect information of each e-mail address and IP address and learn the behaviour of those sources in sending spam including percentage of spam over time. Senders with good reputation will be given a smooth connection, while the senders with bad reputation will be given limited access to send e-mail. However, there are two types of sender reputation technique: a) Rely on remote database Cloudmark Network Classifier (CNC) was the first to be developed for this functionality by Prakash and O’Donnell (2005). It was prototyped using Vipul’s Razor technique and was released as an open source project in 1998. Along with major update of Razor2, they founded the product. CNC is a community-based filter training system. It does not rely upon any one semantic analysis scheme but upon a large set of orthogonal signature schemes that are trained by the community (Prakash and O’Donnell, 2005). The CNC consists of 4 components: agents, nomination servers, catalog servers, and reputation system (known as TeS – Trust evaluation System). The agents is the application to report a feedback, either spam or not, to the nomination server. The feedback take the form of a set of small fingerprints that are generated from the message rather than 19 transmitting the entire content. The CNC require the submitted feedback to be corroborated by multiple trusted members of the community. Once the fingerprint is considered as spam, then the fingerprint will be added to the catalog servers. The fingerprints will be queried from this catalog server giving information to any e-mail system about the likelihood of spam or not. b) Local developed database CNC was good as it store shared information gathered from the world. But, too much dependency on the catalog server will give extra cost in maintaining the connection. Unresponsive query will slowdown the performance. Therefore, sender reputation in local database is a more convenient approach. The first product using this technique is TurnTide, which was bought over by Symantec Corp. By using routing technology to stop spam at the network layer, TurnTide wants to make it technologically can economically infeasible for spammers to target corporate e-mail servers (Schultz, 2004). TurnTide incorporates TCP traffic shaping, a basic engineering technique for inserting quality-of-service and error-handling mechanism into a network. TCP traffic shaping is a way to limit access to bandwidth. Over time, TurnTide will learn the behaviour of each incoming IP address. IP addresses with bad reputation will be given a small pipe of bandwidth, while the good ones will be given a bigger pipe. This will make the spammers frustrated for not being able to spread the spam as fast as they want. The advantage of this technique as compared to CNC is that it does not rely on remote database, as its own database will be developed from the incoming connections. ii. 20 Sender Policy Framework (SPF) and DomainKeys Identified Mail (DKIM), these two approaches try to let the original e-mail provider to tell the world from where or which IP address would they send their e-mail from. The e-mail providers are required to register their IP addresses of SMTP server into DNS. Therefore, the receiving e-mail server will only accept connection from registered IP address, thus eliminating connections created by spammers at their own customizable smtp server. For example, domain Hotmail.com is registered with SPF of IP address 10.10.10.10, so spammers who normally do e-mail forgery by sending e-mail as a hotmail’s user but from an unknown IP address will be rejected. Advantage of this method is it will block most of the spam sources because pretending as a genuine e-mail domain is the best technique to attract one’s attention. iii. Recipient validation, which focusing on terminating the e-mail connection whenever the recipient is found to be invalid, thus reducing the needs to further process the e-mail content. Most e-mail server actually will do this automatically. But, there are some email servers hiding this technique but received all incoming e-mail as to protect the information whether one recipient is valid or not. Such information if exposed to spammers will be useful for them to start spamming the user. But the disadvantages of that protect is that the server need to process and store the incoming e-mail, which is not good at saving the processing load. iv. E-mail harvesting, where an IP address is seen to try to make a connection to send e-mail but it close the connection when the server acknowledge whether the recipient exist or not in the system. Normally, the attacker will try to use a dictionary to launch the attack, it does not seem to be harmful to the system, but the list, which was gathered by this attacker, can be exploited by spammers for their spam activities. v. 21 DNS Blacklist (DNSBL), it contains a collection of IP addresses which were reported to send spam. Any e-mail system can make use of this DNS record to filter IP addresses that are known as spammers. But the list was not accurate and not real-time, thus some IP addresses were still blacklisted even though the real spammer has moved to another IP address. Example of the provider of this DNSBL is spamhaus, which was recently sued for false blacklist. Although an IP address will stay in the blacklist just for a few days, but that will be enough to cause losses to any company that rely on internet and e-mail as the main source of business. Despite of the pros and cons, Ramachandran and Feamster (2006) concluded that the mitigation method based on network-level properties, such as IP address of the last remote mail relay, was a reasonable way to filter the spam before it is processed for content filtering. This would reduce the need for content processing power, and yet eliminate the threat before it reaches the system. The research had also shown the importance of network-level properties, which has two important properties that could potentially lead to more robust filtering: i. Network-level properties are less malleable than those based on an e-mail’s contents ii. Network-level properties may be observable in the middle of the network, or closer to the source of the spam, which may allow spam to be quarantined or disposed of before it ever reaches a destination mail server. However, the study done by Ramachandran and Feamster (2006) was focusing on understanding the network level behaviour of spammer which was not proposing any solution to the findings. But, the finding was good to bring up awareness on e-mail providers to have network level protection for spam issues. Table 2.3 below will show the pros and cons of the mentioned network-level protection for spam: 22 Table 2.2: Comparison of Network-Level Protection Techniques Solution Advantages Disadvantages CNC Valid information as it Dependencies to a remote was contributed by trusted server will need a source all over the world sustained connection. TurnTide Local cache and self learn. Limit of IP address can be stored is up to a few million. However, IP address with less activity will be removed. There is no need to store all IP address for a long time. SPF and DKIM Able to block incoming Depends on remote DNS, connection from forgery but genuine domain will sources. normally provide a reliable DNS service. Recipient Validation Able to prevent extra load Will be resource intensive to process the content. as the query to the database will be done each time a connection is established. Solution Advantages Disadvantages E-mail Harvesting Able to stop connection It requires an intelligence Protection that is wasting the to detect the activity. network resources. DNSBL Good at preventing known Depends on remote spam sources from resource. entering the network. 23 2.2.3 Dynamic Contents In early days, antispam was developed to look for spam by keyword matching. If an e-mail contains a lot of text, the possibility to find keywords or sentences that match with those in the database will be high. But, that had made spammers to be more creative in composing spam, they put less text, and sometimes only a link is embedded in the e-mail. Semantic enrichment is another method to improve the effectiveness of antispam filtering system which was introduced by Kim et al. (2004). It usually done by remodelling database schemas in a higher data model in order to explicitly express semantics that is implicit in the data. Kim et al. (2004) used the technique to enrich semantically useful terms to an e-mail that contains only few terms. The semantic enrichment process is divided into two, structural information and semantical concepts. The method introduced by Kim et al. (2004) consists of two phases, training and testing. The training phase will create classes of spam and non-spam where each class will contain sub-classes. These classes will be kept in DAML+OIL form. DAML is an acronym for DARPA Agent Markup Language. Meanwhile, DARPA is an acronym for Defence Advanced Research Project Agency, which is the central research and development organization for the Department of Defense of US. The goal of DAML is to develop a language and tools to facilitate the concept of the Semantic Web. OIL is formed by Ontology Inference Layer, which is a proposal for a web-based representation an inference layer for ontology and combines the widely used modelling primitives from framed-based languages with the formal semantics and reasoning services provided by description logics. The Test Phase has two steps, Structural Information and Semantic Concepts. The method used by Kim et al. (2004) will convert e-mail into RDF (DAML+OIL) format, therefore e-mail can be mapped easily with the ontology. But, one of the disadvantages of that method, it will need to convert each email into RDF format before it can proceed with the comparison. This method is efficient if the contents of the e-mail are small, but not efficient for e-mail with large 24 content and not efficient for an e-mail system that receives a large volume of emails. Liang et al. (2006) has introduced a Novel Approach for Spam Classification (NASC) which experimentally proven to raise both the recall ratio and precision ratio, and enhance the ability to self-adaptation and diversity for the spam. NASC is an immune-based model to cater for dynamic spam detection. This model is composed of three parts, first is Mail Character Presenting Module (MCPM), second, training module and third, filtering module (Liang et al., 2006). They emphasised that MCPM will use vector spam model (VSM) and present the received mail in discrete words. VSM has been widely used in the traditional information retrieval field. Most search engines also use similar measures based on this model to rank web documents. Documents are mapped into term vector space. A document is conceptually represented by a vector of keywords extracted from the document, with associated weights representing the importance of the keywords in the document and within the whole document. The form of the vector is listed below. V = (t1,w1,… ,tn,wn) Where t1 represent keyword and wi is the associated weights representing the importance of the keywords in the document. The training module consists of four elements: i. Gene Library According to the structure of a document, the gene library is divided into two subsets: mail list gene library and content gene library. This library actually is a collection of keywords where mail list gene library consist of suspicious email addresses or domains which are issued by the authoritative institution. This kind of library can be updated from internet and from manual insertion. Meanwhile, the content gene library consists of keywords which are picked up from any contributors from the internet ii. 25 Immature Detector Immature detector is the term vector which is organized in a pre-defined structure. Every element of the term vector is all from gene library. iii. Affinity Affinity is the match strength between two documents iv. Mature Detector If the match strength between an immature detector and one of the training mails is below the pre-defined threshold, the detector is considered as a mature detector. The Filtering Module consists of: i. Memory Detector It is considered as reliable detector. A mature can become a memory detector if the matching threshold is over active threshold. ii. Co-stimulation. It is the signal comes from the administrator to the system, which is used in the following situations: a) When mature detectors detect a suspicious e-mail, system need this signal to judge the suspicious e-mail whether it is a spam or not. b) When the count field of one mature detector is over active threshold, system will need the co-stimulation signal. If an affirmative costimulation signal writes back to system from network administrator in fixed period, this mature detector will become memory detector. Otherwise, it will be recognized as invalid detector and be deleted from mature detector set. 26 Comparison between NASC and Semantic Enrichment is as per Table 2.4 below Table 2.3: Comparison of Techniques to Cater For Dynamic Contents Solutions Advantages Disadvantages Semantic Enrichment Able to catch spam with Not efficient if the e-mail minimal contents system is receiving big email or large volume of email traffic. NASC Able to utilise computer May require a bigger memory for a faster RAM size to cater for the matching process and memory detectors if the terminate the idle and volume of the traffic is unnecessary detectors huge. It also requires a training set. 2.3 Antispam with Bayesian In normal operation, an antispam system will require a content filter, and Bayesian filter. The content filter will check content of e-mail and find a match with an inbuilt database through regular expression methods. Every matched expression will contribute a score to the message, where the total score that over the preset threshold will classify a message as spam. Here are some examples of regular expression that are used by antispam, Schwartz (2004): /CLICK\s.{0,30}(?:HERE|BELOW)/s /click\s.{0,30}(?:here|below)/is 27 The meaning of above examples are: if there is a sentence either all letters are capital letters or not of “click here” or “click below” or “clicks here” or “clicks below” or “click … here” or “click … below” The examples above are pre-defined by the administrator, but can be collected from the internet from the suggestions by the internet users. At the meantime, Bayesian works on each word without the need for its dependency and sequence. The real work requires statistical functionality with a set of trained data. According to Bekman (2006), we need to feed it with good mail and with undesired e-mail, from that point on a Bayesian filter will try to decide what is spam and what is not and sort the e-mail to different folders, but a constant is still needed on both. One of the drawbacks is it require training, which will give weightage to each word and enable Bayesian to use it statistical method to measure the message. 2.4 Summary The studied techniques in this chapter have given enough input to the thesis to move forward with the next agenda. The following Table 2.5, which consists of techniques with most advantages in each problem area, will show the possible combination for the pattern-based filtering to be developed. Table 2.4: Recommended Combination for Pattern-Based Filtering Technique Solutions Advantages Extra Requirements SVM with Vector To solve issues with None Representation for Image image spam Spam Sender Reputation – local To shape the traffic based Local database to cache all database (TurnTide) on behaviour, thus IP address limiting bandwidth access to spammers 28 Table 2.5: Recommended Combination for Pattern-Based Filtering Technique (Continue) Solutions Advantages Sender Policy Framework Will only allow valid IP Extra Requirements None address representing a valid domain to send emails Recipient Validation Will drop connections None with invalid users thus preventing the CPU from being used to scan the email contents DNSBL Good at preventing a Local DNS cache to known spammer sources eliminate dependencies from sending e-mails with remote DNSBL server NASC Good for content based Require bigger RAM size checking with better performance Therefore, the next chapter is introducing the methodologies to use in developing the suggested solution and the solution to be tested in a control environment in order to the objective is met. However, the thesis is focusing on the filtering mechanism at network level. It is undeniable the network level filtering only compliment the content filter, but it will also help to reduce the processor’s load thus giving a more power to the content filter to work efficiently. CHAPTER 3 METHODOLOGY This chapter puts the proposal into realization by forming a methodology to develop an antispam system by using pattern-based filtering technique. 3.1 Introduction This chapter discusses the proposed methodology in developing an antispam system by using pattern-based filtering technique. Each component that was selected in previous chapter will be mapped into the system architecture depending on the functionality and its requirement. This AntiSpam with Pattern-Based Filter (ASPBF) is expected to open an opportunity to a better antispam solution. 3.2 Methodology The method to be used in this study is to implement antispam with patternbased filtering technique, which will protect the system at network level, e-mail header level and content level. Figure 3.1 and figure 3.2 show flow of process and architecture of the proposed system. 30 3.2.1 Traffic Shaping As we can see from figure 3.1, each incoming connection will be analysed for its historical behaviour and record of rejection to classify them as trusted or not trusted sources, where the trusted sources will be allowed to make the connection, while the untrusted sources will be rejected for a period of time as determined. When other filters in the same system reject an IP address, the IP address will be recorded in the traffic-shaping database. Every rejection will increase the counter of the IP address in the database, if the counter is above the preset threshold, the IP address will be rejected. The counter will be reset if the duration of rejection has elapsed. 3.2.1.1 Functionalities and its preset conditions requirements. i. maxSMTPipDuration, is the duration to allow connection per IP address. ii. maxSMTPipConnects, is The maximum number of SMTP connections an IP Address can make within the maxSMTPipDuration. iii. maxSMTPipExpiration, is the duration to keep IP addresses rejected by maxSMTPipConnect before they are allow to create new connections. iv. Record each connecting IP address v. IPNumTries, is the counter for the occurrence of the IP address within maxSMTPipDuration. vi. Reject connection if IPNumTries is more than maxSMTPipConnects within maxSMTPipDuration. vii. Reject IP address detected by item (iii) until the time reaches maxSMTPipExpiration. viii. Reset the IPNumTries if an ip reconnect after maxSMTPipExpiration. ix. Each IP address with bad history in the next filters will be kept in the same cache and be rejected until the expiration time. 31 3.2.2 Sender Reputation If an IP address is not rejected by the traffic shaping process, it will then be processed and validated against its reputation. The database of sender reputation is maintained by SenderBase (www.senderbase.org), it is a collection of IP addresses as collected by the sensors located every in the world. The sensors collected IP addressed that are found to send spam or being a zombie machines and control by others through a botnet technique. The result of an IP address can be resolved through a DNS query that is made to the SenderBase’s DNS servers. However, in order to make the query faster, each result will be cached locally and will be removed once the duration to keep it has elapsed the preset threshold. Records kept by SenderBase contain blacklist and whitelist with weightage, therefore it is recommended to monitor the accuracy of the records before we can decide to block or allow the traffic. Eventhough it will not block the connection, but adding a score to the final result will also help to filter spam. The followings describe the functionality of the method: i. given an IP address of the sender, Address, a dns record is queried from the provider of Sender Reputation, Host. This is done by function SenderBaseOK, which will return a value of 0 or 1. ii. A value returned by the Host will identify the level of reputation the Address has and shall be treated accordingly, either to block or to let it pass through. iii. The Address and its reputation value will be kept in cache until its value of expiration is exceeded. The function keep in the cache is SBCacheAdd, whilst to remove it is cleanCacheSB. This will ensure the next query will be faster and less dependent on the external sources. 3.2.3 Sender Policy Framework The next filter is to verify the SPF record of the matching IP address and Domain Name. Each domain owner has the capability to register in the DNS their 32 trusted sources to deliver e-mails of their domain. Therefore if an IP address is matched with its sender’s domain, the connection will be granted to pass through. The followings are the functionality of this method: i. There are two items required to verified the validity of the Incoming IP address, first is the IP address, second is the domain name of the sender. ii. If the SPF record of the domain has verified that the incoming IP address is a trusted source, the connection will be granted. The function to do verification is SPFok, which will return 0 or 1. iii. All result from the SPF request will be cached by function SPFCacheAdd and a record will be kept in the cache untill its expiry time and to be removed by function cleanCacheSPF. Figure 3.1: System Architecture (Part 1) 33 3.2.4 DNS BlackList (DNSBL) Figure 3.2 show the next filters as a continuity of the Figure 3.1. DNS Black List (DNSBL), also known as RealTime BlackList, is almost similar to SenderBase, but it will only keep the blacklisted IP addresses, while SenderBase keep both blacklist and whitelist of IP addresses with weightage. The sources of the blacklist are maintained by various parties and up to the users to choose from. But the popular sources are spamhaus.org and spamcop.net, where most Internet users recommend using due to its reliability. Volunteer runs this kind of organizations, thus the reliability and accuracy of the data are still questionable even though it was said to be more then 80% accurate. The followings describe the functionality of the method: i. given an IP address of the incoming connection, it is checked against a DNS record provided by the RBL provider. There are currently three (3) provider used for this application: (1) bl.spamcop.net, (2) list.dsbl.org, (3) zen.spamhaus.org. ii. the function to do the checking is RBLok which will return 0 or 1, 0 for a bad status and 1 for a good status. Figure 3.2: System Architecture (Part 2) 34 3.2.5 Recipient Validation The last filter as describe in Figure 3.2 is the Recipient Validation. Each email system should have its own database of users, thus this database can be utilized by the proposed antispam in ensuring only e-mail with valid recipient is processed, avoiding unnecessary load to the server. The following is the functionality of this method: i. Each incoming e-mail will be checked against LDAP for the validity of the recipient, if the recipient cannot be found in LDAP, the e-mail is rejected. This is to ensure there is no need for a further processing and to save CPU load. ii. Function to do the LDAP query is LDAPquery, which will return 0 for a failure and 1 if succeeded. iii. Each result returned by the LDAP server will be kept in a memory cache by an array LDAPlist. 3.3 Improvement To Expect As per figures above, we can identify which part to improve from old-fashion antispam, and they are: i. Number of e-mails to scan based on content will reduce as more e-mails are eliminated at network level. ii. It is not just to reduce the total messages to scan but also reduce the CPU load by eliminating the untrusted sources from sending messages at network level. iii. 35 In normal practice, 100% of incoming messages will be processed and scanned for spam, however using this ASPBF it is expected to reduce the number of messages to be scanned. iv. The ultimate expectation of this simulation is to prove the techniques embraced by ASPBF will help to improve overall process of filtering spam. 3.4 Sample Data Data to be used in this test will be categorized into the following characteristics: i. Spam coming from IP addresses that are recognised as spam sources. This type of spam should be eliminated at network level, preventing further processing load. ii. Spam or e-mail with invalid recipient. The connection will be terminated as soon as the recipient in the e-mail header is found to be invalid. iii. Spam in plain text The plaintext spam is the basic form of spam. This can be used to measure the performance of the content-based scanning. iv. Spam in mix of text and image This will require both content scanning and image scanning, where Bayesian is possibly to give a negative result. v. 36 Spam in image Bayesian filter will apparently fail to capture the spam. 3.5 Requirements for the simulation The proposed antispam system will be developed using a Perl software and the programming language is Perl language, as it is the language used by SpamAssassin to integrate with Bayesian filter. Therefore, it will be easier to develop the antispam with Bayesian for benchmarking with the new antispam system. Furthermore, this system which is developed in Perl can easily be integrated with any existing e-mail software. A unix machine is recommended in the development of the said system as well as to run the benchmark due to: i. Other unnecessary application that is running in the Operating System can be closed as to avoid it from affecting the performance benchmarking process. ii. The antispam system will be developed using Perl which is more suitable to run in Unix-based system. Perl also being used to develop the previous antispam (SpamAssassin), therefore using the same platform will provide a fair competition. 3.5.1 Development Tools As to ensure the Antispam with Pattern-Based Filter (ASPBF) is comparable to existing SpamAssassin with Bayesian from the view of processing load and performance, Perl programming language will be used to reduce biasness when comparing it with SpamAssassin which is also developed using the same programming language. 37 Although applications developed in Perl are not performing better that those developed in C in term of speed of execution, but the objective of this development is to show results from both systems, one with ASPBF another one with Bayesian. Comparing those results will potentially prove that an improvement can be made to existing antispam solution. In this development, multi-threaded programming is used to ensure the application is able to handle more concurrent connections. 3.5.2 Prototyping The ASPBF was developed using Perl programming language with all the necessary modules were pre-loaded and included in the main program. The logics of each of the filter will be described below. i. Traffic Shaping The records of IP address and their history are kept in local cache, each record will have its expiry time. Exceeding the expiry time will cause the record to be removed from the cache. sub PBOK { my($fh,$myip,$lvl)=@_; my $this=$Con{$fh}; $myip = $this->{cip} if $this->{ispip} && $this->{cip}; d('PBOK'); $this->{prepend}=""; my $t=time; my $ip=ipNetwork($myip,($PenaltyUseNetblocks ? 24 : 32)); return 1 if (!exists $PBBlack{$ip}); my($ct,$ut,$level,$totalscore,$sip,$reason)=split(" ",$PBBlack{$ip}); Figure 3.3: Function of Traffic Shaping (Part 1) 38 $this->{mypbreason}="totalscore for $this->{ip} is $totalscore, last penalty was '$reason'"; return 1 if $totalscore<$PenaltyLimit; $this->{prepend}="[PenaltyBox]"; mlog( $fh, "[monitoring] $this->{mypbreason}" ) if $DoPenalty == 2 || $DoPenalty == 3; return 1 if $DoPenalty==2 || $DoPenalty==3; return 0; } Figure 3.4: Function of Traffic Shaping (Part 2) ii. Sender Reputation The record of each IP address will be tested with a SenderBase’s provider, in this study, the provider is test.senderbase.org. Provided with an IP address: • Check the cache for a record of that IP address • If not found in the cache, do DNS query to test.senderbase.org • Receive the value, if the value falls under whitelist, record the information in Penalty Box but as a whitelist IP address. Otherwise if a blacklist value is received, record it in the Penalty Box as blacklisted. #queries the SenderBase service sub SenderBaseOK { my ( $fh, $ip, $domain ) = @_; my $this = $Con{$fh}; $ip = $this->{cip} if $this->{ispip} && $this->{cip}; my $this = $Con{$fh}; d('SenderBaseOK'); my $results; my $query; my $cache; Figure 3.5: Function of Sender Reputation (Part 1) 39 my $skip; my $tlit; return 1 if !$CanUseSenderBase; return 1 if $this->{ip} =~ /$IPprivate/; #return 1 if $this->{contentonly}; my ( $ipcountry, $orgname, $domainname ) = split( "-", SBCacheFind($ip) ); my $blacklistscore; my $fortune1000; my $domainrating; my $bondedsender; my $domainname; my $resultip; if ( !SBCacheFind($ip) ) { eval { $query = Net::SenderBase::Query->new( Transport => 'dns', Address => $ip, Host => 'test.senderbase.org', Timeout => 10, ); $results = $query->results; }; if ($@) { &sigon(); return 1; } $bondedsender = $results->ip_in_bonded_sender; $blacklistscore = $results->ip_blacklist_score; $orgname = $results->org_name; $resultip = $results->ip; $fortune1000 = $results->org_fortune_1000; $domainname = $results->domain_name; $domainrating = $results->domain_rating; Figure 3.6: Function of Sender Reputation (Part 2) 40 $ipcountry = $results->ip_country; SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" ); &sigon(); } else { $cache = 1; } if ($DoOrgBlocking) { $this->{prepend} = "[Organization]"; $tlit = "[monitoring]" if $DoOrgBlocking == 2; $tlit = "[scoring]" if $DoOrgBlocking == 3; if ($orgname =~ ( '(' . $whiteSenderBaseRE . ')' ) || $domainname =~ ( '(' . $whiteSenderBaseRE . ')' )) { mlogRe( $fh, $1, "WhiteOrg" ); $this->{noprocessing} = 1; $this->{noprocessingreason} = "senderbase regex"; pbWhiteAdd( $fh, $ip, "WhiteSenderBase:$1" ); my $nsborgValencePB = 0 - $sborgValencePB if $sborgValencePB; pbAdd( $fh, $ip, $nsborgValencePB, "WhiteSenderBase:$1" ) if $DoOrgBlocking != 2; mlog( $fh, "$tlit White Organization in SenderBase:$1", 1 ) if $SenderBaseLog == 2; return 1; } if ($orgname =~ ( '(' . $blackSenderBaseRE . ')' ) || $domainname =~ ( '(' . $blackSenderBaseRE . ')' )) { mlogRe( $fh, $1, "BlackOrg" ); pbWhiteDelete( $fh, $ip, "BlackSenderBase:$1" ); pbAdd( $fh, $ip, $sborgValencePB, "BlackSenderBase:$1" ) if $DoOrgBlocking != 2; $this->{messagereason} = "Black Organization in SenderBase:$1"; Figure 3.7: Function of Sender Reputation (Part 3) 41 $ipcountry = $results->ip_country; SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" ); &sigon(); } else { $cache = 1; } if ($DoOrgBlocking) { $this->{prepend} = "[Organization]"; $tlit = "[monitoring]" if $DoOrgBlocking == 2; $tlit = "[scoring]" if $DoOrgBlocking == 3; if ($orgname =~ ( '(' . $whiteSenderBaseRE . ')' ) || $domainname =~ ( '(' . $whiteSenderBaseRE . ')' )) { mlogRe( $fh, $1, "WhiteOrg" ); $this->{noprocessing} = 1; $this->{noprocessingreason} = "senderbase regex"; pbWhiteAdd( $fh, $ip, "WhiteSenderBase:$1" ); my $nsborgValencePB = 0 - $sborgValencePB if $sborgValencePB; pbAdd( $fh, $ip, $nsborgValencePB, "WhiteSenderBase:$1" ) if $DoOrgBlocking != 2; mlog( $fh, "$tlit White Organization in SenderBase:$1", 1 ) if $SenderBaseLog == 2; return 1; } if ($orgname =~ ( '(' . $blackSenderBaseRE . ')' ) || $domainname =~ ( '(' . $blackSenderBaseRE . ')' )) { mlogRe( $fh, $1, "BlackOrg" ); pbWhiteDelete( $fh, $ip, "BlackSenderBase:$1" ); pbAdd( $fh, $ip, $sborgValencePB, "BlackSenderBase:$1" ) if $DoOrgBlocking != 2; $this->{messagereason} = "Black Organization in SenderBase:$1"; Function 3.8: Function of Sender Reputation (Part 4) 42 return 0 if $DoOrgBlocking == 1; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $SenderBaseLog == 2; } } return 1 if !$DoCountryBlocking; return 1 if $ipcountry =~ $NoCountryCodeReRE; $this->{mycountry} = 0; if ( $ipcountry && $DoCountryBlocking && ($ipcountry =~ $CountryCodeBlockedReRE || ($CountryCodeBlockedReRE =~ /all|\*/i && $ipcountry !~ $MyCountryCodeReRE && $ipcountry !~ $CountryCode ReRE))){ $this->{messagereason} = "Blocked Country $ipcountry $orgname"; $this->{prepend} = "[CountryCode]"; $tlit = "[monitoring]" if $DoCountryBlocking == 2; $tlit = "[scoring]" if $DoCountryBlocking == 3; pbAdd( $fh, $ip, $bccValencePB, "BlockedCountry$ipcountry") if $DoCountryBlocking != 2; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $DoCountryBlocking == 2 || $DoCountryBlocking == 3; SBCacheAdd( $ip, 1, "$ipcountry-$orgname-$domainname" ); return 0 if $DoCountryBlocking == 1 || $DoCountryBlocking == 4; return 1; } return 1 if !$DoSenderBase; return 1 if !$CountryCodeRe && !$MyCountryCodeReRE; $tlit = "[monitoring]" if $DoSenderBase == 2; $tlit = "[scoring]" if $DoSenderBase == 3; Function 3.9: Function of Sender Reputation (Part 5) 43 $sbhccValencePB = 0 - $sbhccValencePB if $sbhccValencePB; if ( $sbhccValencePB < 0 && $ipcountry && $MyCountryCodeReRE && $ipcountry =~ $MyCountryCodeReRE ) { $this->{prepend} = "[CountryCode]"; $this->{mycountry} = 1; $this->{messagereason} = "Home Country Bonus $ipcountry $orgname"; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $SenderBaseLog == 2; pbAdd( $fh, $ip, $sbhccValencePB, "HomeCountry-$ipcountry" ) if $DoSenderBase != 2; return 1; } if ( $sbfccValencePB && $ipcountry && $MyCountryCodeReRE && $ipcountry !~ $MyCountryCodeReRE && !( $CountryCodeReRE && $ipcountry =~ $CountryCodeReRE ) ) { $this->{messagereason} = "Foreign Country $ipcountry $orgname"; pbAdd( $fh, $ip, $sbfccValencePB, "CountryCode-$ipcountry", 1 ) if $DoSenderBase != 2; $this->{prepend} = "[CountryCode]"; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $SenderBaseLog == 2; return 1; } if ( $sbfccValencePB && $ipcountry && $MyCountryCodeReRE Function 3.10: Function of Sender Reputation (Part 6) 44 && $ipcountry !~ $MyCountryCodeReRE && ( $CountryCodeReRE && $ipcountry =~ $CountryCodeReRE ) ) { $this->{messagereason} = "Foreign & Suspicious Country $ipcountry - $orgname"; pbAdd( $fh, $ip, $sbfccValencePB + $sbsccValencePB, "CountryCode-$ipcountry", 1 ) if $DoSenderBase != 2; $this->{prepend} = "[CountryCode]"; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $SenderBaseLog; return 1; } if ( $sbsccValencePB && $ipcountry && $CountryCodeReRE && $ipcountry =~ $CountryCodeReRE ) { $this->{messagereason} = "Suspicious Country $ipcountry - $orgname"; pbAdd( $fh, $ip, $sbsccValencePB, "CountryCode-$ipcountry", 1 ) if $DoSenderBase != 2; $this->{prepend} = "[CountryCode]"; mlog( $fh, "$tlit $this->{messagereason}", 1 ) if $SenderBaseLog; return 1; } mlog( $fh, "SenderBase showing: $ipcountry - $orgname", 1 ) if $SenderBaseLog == 2 && $ipcountry; return 1; } Function 3.11: Function of Sender Reputation (Part 7) 45 iii. Sender Policy Framework This will require a SPF module version 2 to ensure SPF records can be extracted regardless of which version of SPF records are used. Provided with an IP address and a domain name of the sender • Check the matching record with local cache • If not found, do DNS query from the known DNS server that keep the records of the domain • Receive the SPF record • Record the information into the cache for future use • If a fail value is received, record in Penalty Box Cache for later use by Traffic Shaping. Return the value of the SPF record to the calling function • # uses Mail::SPF v2.005 sub SPF2ok { my $fh = shift; my $this = $Con{$fh}; my $ip = $this->{ip}; my $strict; my $block; $ip = $this->{cip} if $this->{ispip} && $this->{cip}; my $helo = $this->{helo}; $helo = $this->{ciphelo} if $this->{ispip} && $this->{cip}; undef $helo if $this->{ispip} && $this->{cip} && !$this>{ciphelo}; &sigon(); my $mValidateSPF = $ValidateSPF; $mValidateSPF = 3 if $switchTestToScoring && $DoPenaltyMessage && ( $spfTestMode || $allTestMode ); my $tlit = ""; $tlit = "[monitoring]" if $mValidateSPF == 2; $tlit = "[scoring]" if $mValidateSPF == 3; $this->{prepend} = "[SPF]"; my ( $spf_result, $local_exp, $authority_exp, $spf_record, $spf_fail, $received_spf ); my $mf = lc $this->{mailfrom}; my $mfd = $1 if $mf =~ /\@(.*)/; my ( $cachetime, $cresult, $cdomain, $chelo ) = SPFCacheFind($ip); Function 3.12: Function of Sender Policy Framework (Part 1) 46 if ($cachetime) { #check for valid cache hit # mlog( $fh, "SPF $tlit: (cache) $cresult $cdomain $chelo", 1, 1 ) if $SPFLog == 2; if ( ( $mfd eq $cdomain ) && ( lc $helo eq $chelo or $ip eq $this->{cip}) ) { $spf_result = $cresult; } } if ( !$spf_result ) { $cresult = ""; my $query; eval { my ( $identity, if ($mfd) { $identity = $scope = } else { $identity = $scope = } $scope ); $this->{mailfrom}; 'mfrom'; $helo; 'helo'; my $res = Net::DNS::Resolver->new( nameservers => \@nameservers, tcp_timeout => $DNStimeout, udp_timeout => $DNStimeout, retrans => $DNSretrans, retry => $DNSretry ); my $spf_server = Mail::SPF::Server->new( hostname => $myName, dns_resolver => $res ); my $request = Mail::SPF::Request->new( versions => [ 1, 2 ], scope => $scope, identity => $identity, ip_address => $ip, helo_identity => $helo ); my $result = $spf_server->process($request); $spf_record = $request->record; $spf_result = $result->code; $local_exp = $result->local_explanation; $authority_exp = $result->authority_explanation if $result->is_code('fail'); $received_spf = $result->received_spf_header; }; Function 3.13: Function of Sender Policy Framework (Part 2) 47 #exception check if ($@) { mlog( $fh, "SPF2OK: $@", 1, 1 ) if $ExceptionLogging; &sigon(); return 1; } &sigon(); if ( $SPFCacheInterval > 0 && ( $spf_result !~ /error/ ) ) { SPFCacheAdd( $ip, $spf_result, $mfd, $helo ); } } if ( $spf_result eq 'fail' || $spf_result eq 'softfail' && $SPFsoftfail || $spf_result eq 'softfail' && $strict || $spf_result eq 'neutral' && $SPFneutral || $spf_result eq 'neutral' && $strict || $spf_result || $spf_result || $spf_result ) { $spf_fail = 1; pbWhiteDelete( eq 'none' && $SPFnone =~ /unknown/ && $SPFunknown =~ /error/ && $SPFqueryerror $fh, $ip ); } elsif ( $spf_result eq 'pass' ) { $spf_fail = 0; $this->{spfok} = 1; } $received_spf = "SPF: $spf_result"; $received_spf .= " (cache)" if $cresult; $received_spf .= " ip=$ip"; $received_spf .= " mailfrom=$this->{mailfrom}" if ( defined( $this->{mailfrom} ) ); $received_spf .= " helo=$helo" if ( defined($helo) ); mlog( $fh, "$tlit $received_spf", 0, 1 ) if $SPFLog; return 1 if $mValidateSPF == 2; $this->{messagereason} = "SPF $spf_result"; Function 3.14: Function of Sender Policy Framework (Part 3) 48 if ( $spf_result eq 'neutral' && $spfnValencePB ) { pbAdd( $fh, $ip, $spfnValencePB, "SPF$spf_result" ); } elsif ( $spf_result eq 'softfail' && $spfsValencePB ) { pbAdd( $fh, $ip, $spfsValencePB, "SPF$spf_result" ); } elsif ( $spf_result eq 'none' && $spfnonValencePB ) { pbAdd( $fh, $ip, $spfnonValencePB, "SPF$spf_result" ); } elsif ( $spf_result =~ /unknown/ && $spfuValencePB ) { pbAdd( $fh, $ip, $spfuValencePB, "SPF$spf_result" ); } elsif ( $spf_result =~ /error/ && $spfeValencePB ) { pbAdd( $fh, $ip, $spfeValencePB, "SPF$spf_result" ); } elsif ( $spf_fail == 1 && $spfValencePB ) { pbAdd( $fh, $ip, $spfValencePB, "SPF$spf_result" ); } $this->{messagereason}.$strict if $strict; return 1 if $mValidateSPF == 3 && !$block; if ( $spf_fail == 1 ) { # SPF fail (by our local rules) my $reply = $SPFError; $reply =~ s/SPFRESULT/$local_exp/g; $Stats{spffails}++ unless $slok; $this->{prepend} = "[SPF]"; thisIsSpam( $fh, "SPF $spf_result", $SPFFailLog, $reply, $spfTestMode, $slok, 0 ); return 0; } return 1; } Function 3.15: Function of Sender Policy Framework (Part 4) iv. DNS Blacklist (DNSBL) DNSBL also known as Realtime Blacklist (RBL), provided an IP address: • Check the IP in the cache • If not found, do DNS query from a list of DNSBL providers • Receive the value, if any value is received, the IP address is considered as a blacklisted IP address. If no value is received after a timeout, the IP is considered as clean. • Keep records of the information in local cache • Keep the blacklisted IP address into Penalty Box cache • Return the value to the calling function 49 sub RBLok { my ($fh) = @_; my $this = $Con{$fh}; my $reason; my $rblweighttotal; my $ip = $this->{ip}; $ip = $this->{cip} if $this->{ispip} && $this->{cip}; return 1 if $this->{rbldone}; $this->{rbldone} = 1; d('RBLok'); return 1 if !$ValidateRBL; return 1 if !$CanUseRBL; ##return 1 if $this->{contentonly}; return 1 if $this->{whitelisted} && !$RBLWL; my $slok = $this->{allLoveRBLSpam} == 1; my $mValidateRBL = $ValidateRBL; $this->{testmode} = $rblTestMode || $allTestMode; $mValidateRBL = 3 if $ValidateRBL==1 && $switchTestToScoring && $DoPenaltyMessage && $this->{testmode}; my $tlit; $tlit = "" if $mValidateRBL == 1; $tlit = "[monitoring]" if $mValidateRBL == 2; $tlit = "[scoring]" if $mValidateRBL == 3; $this->{prepend} = "[DNSBL]"; my $rbl = eval { RBL->new( lists => [@rbllist], server => $nameservers[0], max_hits => $RBLmaxhits, max_replies => $RBLmaxreplies, query_txt => 1, max_time => $RBLmaxtime, timeout => $RBLsocktime ); }; Function 3.16: Function of DNS Blacklist (Part 1) 50 # add exception check if ($@) {&sigon();return 1; } my ( $received_rbl, $rbl_result, $lookup_return ); $lookup_return = $rbl->lookup( $ip, "RBL" ); &sigon(); mlog($fh,"error: RBL check failed : $lookup_return") if ($lookup_return ne 1); return 1 if ($lookup_return ne 1); my @listed_by = $rbl->listed_by(); my $rbls_returned = $#listed_by + 1; if ( $rbls_returned > 0 ) { foreach (@listed_by) { $rblweighttotal += weight($rblweight{$_}) if $rblweight{$_}; } my $rblweight = $rblValencePB; my $rblweightn = $rblnValencePB; $rblweight = $rblweightn = $rblweighttotal if $rblweighttotal; $reason = $this->{messagereason} = ""; if ( $rbls_returned >= $RBLmaxhits && !$rblweighttotal || $rblweighttotal >= $RBLmaxweight) { pbWhiteDelete( $fh, $ip ); $this->{messagereason} = "DNSBL: failed, $ip listed in @listed_by"; pbAdd( $fh, $ip, $rblweight, "DNSBLfailed" ) if $mValidateRBL != 2; $received_rbl = "DNSBL: failed, $ip listed in ("; RBLCacheAdd( $ip, "1", "@listed_by" ) if $RBLCacheExp > 0; Function 3.17: Function of DNS Blacklist (Part 2) 51 } else { pbWhiteDelete( $fh, $ip ); $this->{messagereason} = "DNSBL: neutral, $ip listed in @listed_by"; $this->{prepend} = "[DNSBL]"; mlog( $fh, "[scoring] DNSBL: neutral, $ip listed in @listed_by" ) if ( $RBLLog && $mValidateRBL == 1 ); pbAdd( $fh, $ip, $rblweightn, "DNSBLneutral" ) if $mValidateRBL != 2; $this->{rblneutral} = 1; RBLCacheAdd( $ip, "1", "@listed_by" ) if $RBLCacheExp > 0; $received_rbl = "DNSBL: neutral, $ip listed in ("; } foreach (@listed_by) { $received_rbl .= "$_<-" . $rbl->{results}->{$_} . "; "; } $received_rbl .= ")"; } else { $received_rbl = "DNSBL: pass"; RBLCacheAdd( $ip, "2") if $RBLCacheExp > 0; return 1; } mlog( $fh, "$tlit ($received_rbl)" ) if $received_rbl ne "DNSBL: pass" && ($RBLLog == 2 || $RBLLog && $mValidateRBL >= 2 ); return 1 if $mValidateRBL == 2; # add to our header; merge later, when client sent own headers $this->{myheader} .= "X-Assp-$received_rbl\r\n" if $AddRBLHeader && $received_rbl ne "DNSBL: pass"; if ( $rbls_returned >= $RBLmaxhits && !$rblweighttotal || $rblweighttotal >= $RBLmaxweight) { my $slok = $this->{allLoveRBLSpam} == 1; $Stats{rblfails}++; Function 3.18: Function of DNS Blacklist (Part 3) 52 return 1 if $mValidateRBL == 3; my $reply = $RBLError; $reply =~ s/RBLLISTED/@listed_by/g; $this->{prepend} = "[DNSBL]"; thisIsSpam( $fh, "DNSBL, $ip listed in @listed_by", $RBLFailLog, "$reply", $rblTestMode, $slok, ( $slok || $rblTestMode || $this->{spamlover} ) ); return 0; } return 1; } Function 3.19: Function of DNS Blacklist (Part 4) v. Recipient Validation Since the information of users in the local domain is kept in LDAP server, an LDAP query will be done. Provided with an e-mail address of the recipient: • Check with local cache for the existance of the e-mail address. • If not found, do query to LDAP server • If found return value 1. If no record found, return value 0 to the calling function • Keep the e-mail address in the cache sub LDAPQuery { my ($ldapflt, $ldaproot, $current_email) = @_; my $retcode; my $retmsg; my @ldaplist; my $ldaplist; my $ldap; my $mesg; my $entry_count; $current_email = lc($current_email); if (exists $LDAPlist{$current_email}) { mlog(0,"LDAP - found $current_email in LDAPlist") if $LDAPLog; d("found $current_email in LDAP-cache"); {lock($lockLDAPlist); my ($vt,$vl) = split(/ /,$LDAPlist{$current_email}); Function 3.20: Function of DNS Blacklist (Part 1) 53 if ($vl) { $LDAPlist{$current_email}=time." $vl"; } else { $LDAPlist{$current_email}=time; } } return 1; } d("doing LDAP lookup with $ldapflt in $ldaproot"); @ldaplist = split(/\|/,$LDAPHost); $ldaplist = \@ldaplist; my $scheme = $DoLDAPSSL? 'ldaps' : 'ldap'; $ldap = Net::LDAP->new( $ldaplist, timeout => $LDAPtimeout, scheme => $scheme ); if(! $ldap) { mlog(0,"Couldn't contact LDAP server at $LDAPHost -- check ignored"); # seterror($fh,"451 Could not check recipient, try later\r\n",1); &sigon(); return !$LDAPFail; } # bind to a directory anonymous or with dn and password if ($LDAPLogin) { $mesg = $ldap->bind($LDAPLogin, password => $LDAPPassword, version => $LDAPVersion); } else { # mlog($fh,"LDAP anonymous bind"); $mesg = $ldap->bind( version => $LDAPVersion ); } $retcode = $mesg->code; if ($retcode) { mlog(0,"LDAP bind error: $retcode -- check ignored",1); $ldap->unbind; &sigon(); return !$LDAPFail; } # perform a search $mesg = $ldap->search(base => $ldaproot, filter => $ldapflt, attrs => ['cn'], sizelimit => 1 ); $retcode = $mesg->code; &sigon(); # mlog($fh,"LDAP search: $retcode"); if($retcode > 0 && $retcode != 4) { mlog( 0, "LDAP search error: $retcode -- '$ldapflt' check ignored", 1 ) ; #seterror($fh,"451 Could not check recipient, try later\r\n",1); $ldap->unbind; return !$LDAPFail; } Function 3.21: Function of DNS Blacklist (Part 2) 54 $entry_count = $mesg->count; $retmsg = $mesg->entry(1); mlog(0,"LDAP Results $ldapflt: $entry_count : $retmsg") if $LDAPLog; d("got $entry_count result(s) from LDAP lookup"); $mesg = $ldap->unbind; # take down session if($entry_count && $ldaplistdb) { lock($lockLDAPlist); $LDAPlist{$current_email}=time; mlog(0,"LDAP added $current_email to LDAPlist") if $LDAPLog; d("added $current_email to LDAP-cache"); } return $entry_count; } Function 3.22: Function of DNS Blacklist (Part 3) 3.5.3 Implementation ASPBF was installed in a company that provided e-mail services. However, the simulation was carried out in a control environment where a small percentage of traffics are diverted from the internet to the machine that host ASPBF and another percentage to be redirected to another machine hosting the antispam without ASPBF. This was a better simulation compared to a lab simulation where the real traffic playing the roles and all filtering components would work as they were expected, while lab simulation may not provide more accurate result. 3.5.4 Simulation Environment The simulation was conducted in a corporate environment, whose business is providing email services to internet users. They receives about one million e-mails a day with estimated about 70% of the total are spam. They have offered to host the simulation in their environment but they wanted to keep themselves anonymous. Having this simulation run in the real corporate environment had given opportunities to have a result that was nearly accurate and helped to reduce efforts required in preparing a laboratory environment for the same simulation. 55 It was run in a Solaris machine with a 4-core Sparc CPU. The advantage of using this model was it can provide enough cpu power to host a multi-threaded application. Each core of the 4-core CPU contains 4 thread processors. Thus, altogether are 16 CPUs available to the application. During the simulation, there are two machines used, first for ASPBF and second is for antispam with Bayesian. The first machine has received about 90,807 e-mail connections in a day equivalent to 10% of total traffic. The second machine recive about 140,000 e-mail connection equivalent to 20% of total traffic. As this test was conducted in real enterprise environment and in order to reduce the risk of being flooded with spam, the test for ASPBF was reduced to only accept 10% in ensuring less spam to go through as result from no scanning for the content. At the meantime the second machine will receive 20% due to it will survive the incoming spam as it will process the content for spam and eliminate them. Due to heavy load, the application was set to run in multithreaded mode with a total of 64 threads and this would enable the application to receive a total of 64 concurrent connection. The actual machines are capable to receive about 120 concurrent connections. However, due to lacking of memory management, this test was limited to 64 concurrent connection only. 3.6 Test Methodology The test will be carried out by generating traffics with various spam characteristics, which include: i. IP addresses that are rejected by Traffic Shaping ii. IP addresses that are rejected by Sender Reputation iii. IP addresses that can generate a number of traffic in a short period of time iv. IP addresses and domain names that are rejected due to invalid SPF record v. E-mails with Invalid recipients 56 Figure 3.23: Placement of the test components Figure 3.23 shows the placement of the test data, the ASPBF and the expected outcome. The total of e-mail filter by ASPBF and total e-mail pass through the filter are crucial in identifying the effectiveness of the filter. Although the test can be carried out in a lab environment, but the best way to do it is through real traffic. This will reduce the time to develop the tester program that able to change IP addresses, change e-mail recipients, change e-mail sender, populate database with users, populate domain with SPF records and many more that require more attention than developing the solution. The total number of e-mails that can pass through will determine the amount of load required to process the e-mail based on its contents. At the same time, the total connections dropped will determine the number of unnecessary connection being dropped to save the processor’s load. 3.7 Summary Method as described in this chapter was designed in order to meet the objectives as mentioned in Chapter 1. Having the right structure, the method might be able to improve overall performance as it introduced elimination process at network level, less CPU load but faster processing due to less intensive calculation and in-memory scanning, and to improve catch rate as Bayesian filter was known to have problem processing image spam. Combination of the mentioned improvements will lead to a better result as it reduces potential spam and better spam catch rate. 57 The elimination before content processing will also improve performance as it utilises less CPU. CHAPTER 4 RESULTS AND ANALYSIS The focus of this chapter is to discuss briefly the result of the simulation and analysis of the result in identifying the necessity of the solution in the real world to fight against spam. The result would then be useful in studying further improvement in the future. 4.1. Introduction The test was carried out in a real corporate environment as described in Chapter 3, and has given a result as expected. The result consists of outcome from two different antispam, first was from ASPBF and the second was from an antispam with Bayesian. Both results will be analysed and compared in order to show that an improvement has been made from ASPBF implementation. That information will be useful in finding a proof that it has made improvement to existing antispam and this will be useful to improve antispam as a whole in the future. 4.2. Result The test would capture each of its activity into a log file. Every filter that was used in identifying spam would be captured as well as to make it easier to know which filter had done blocking the most. 59 Log of activity of each component of ASPBF will be shown below: i) DNSBL Mar-5-09 00:16:32 83392-05790 [Worker_2] [DNSBL] 24.232.0.248 <apiarg@fibertel.com.ar> [scoring] DNSBLcache: neutral, 24.232.0.248 listed in blackholes.five-ten-sg.com Mar-5-09 00:16:34 83394-02630 [Worker_11] [DNSBL] 116.12.52.167 <Muse00261d43001b61@email1.bdmg.museondemand.com> [scoring] DNSBLcache: neutral, 116.12.52.167 listed in ix.dnsbl.manitu.net Mar-5-09 00:16:38 83398-01115 [Worker_15] 209.173.98.220 <bsgrainy@chibardun.net> Message-Score: total for this message is 77, added 77 for DNSBLcache: failed, 209.173.98.220 listed in bl.spamcop.net, psbl.surriel.com Mar-5-09 00:16:50 83410-06242 [Worker_43] 209.133.56.154 <bounceOffers-0a.033n.4.vf08k4qas6.0@Tradepubs.nl00.net> Message-Score: total for this message is 52, added 52 for DNSBLcache: failed, 209.133.56.154 listed in dnsbl-2.uceprotect.net, dnsbl3.uceprotect.net Figure 4.1: Logs For DNSBL Filter Figure 4.1 above shows which IP address was checked against DNSBL’s records. A “neutral” result showed the IP address was a moderately dangerous, but a “failed” status showed the IP address was a confirmed source of spam. The log also showed which source of DNSBL was used. Some of the reliable sources are spamcop.net, surriel.com, uceprotect.net. 60 ii) SPF Mar-5-09 00:16:32 83391-01219 [Worker_5] [SPF] 203.115.234.24 <newsletters@mystar.com.my> to: haiyent@LOCALDOMAIN SPF: none (cache) ip=203.115.234.24 mailfrom=newsletters@mystar.com.my helo=idcsmtp.arc.net.my Mar-5-09 00:16:39 83397-03276 [Worker_13] [SPF] 209.85.218.180 <botakchun@gmail.com> to: botak@LOCALDOMAIN SPF: pass (cache) ip=209.85.218.180 mailfrom=botakchun@gmail.com helo=mail-bw0f180.google.com Mar-5-09 00:16:42 83398-03745 [Worker_19] 209.85.219.158 <mail355@webairsa.info> to: effis@LOCALDOMAIN Message-Score: total for this message is 15, added 5 for SPF temperror Mar-5-09 00:17:05 83417-02279 [Worker_51] 88.231.11.116 <alor@LOCALDOMAIN> to: alor@LOCALDOMAIN Message-Score: total for this message is 37, added 10 for SPF fail Mar-5-09 00:17:06 83408-00173 [Worker_42] 209.62.76.34 <globalsteelweb@gmail.com> to: alumac@LOCALDOMAIN Message-Score: total for this message is 15, added 5 for SPF neutral Mar-5-09 00:17:06 83408-00173 [Worker_42] [SPF] 209.62.76.34 <globalsteelweb@gmail.com> to: alumac@LOCALDOMAIN [spam found] (SPF neutral) [News 04 March 09]; Figure 4.2: Logs For SPF Filter In Figure 4.2 above, all local domain named are replaced with “LOCALDOMAIN” as to ensure the anonymity of the corporate environment is kept as required. In the above example, there were attempts made by spammer to pretend to be a local user to send spam intended for local user, but local domain is protected by its SPF records, thus those fake senders are rejected. 61 When “SPF fail” was issued, the connection was immediately rejected as ASPBF was certain with the SPF records of that senders’ domain and no one else can pretend to come from that domain. iii) Recipient Validation Mar-5-09 00:16:43 [Worker_20] LDAP Results (&(mail=yatie37@ LOCALDOMAIN)(mailUserStatus=active)): 0 : Mar-5-09 00:16:43 [Worker_30] LDAP Results (&(mail=geecha@ LOCALDOMAIN)(mailUserStatus=active)): 0 : Mar-5-09 00:16:43 [Worker_6] LDAP Results (&(mail=kamikaze125@ LOCALDOMAIN)(mailUserStatus=active)): 1 : Mar-5-09 00:16:43 [Worker_27] LDAP Results (&(mail=matnor@ LOCALDOMAIN)(mailUserStatus=active)): 1 : Mar-5-09 00:16:43 [Worker_30] LDAP Results (&(mail=geesee@ LOCALDOMAIN)(mailUserStatus=active)): 0 : Mar-5-09 00:16:43 [Worker_11] LDAP Results (&(mail=chilux@ LOCALDOMAIN)(mailUserStatus=active)): 1 : Mar-5-09 00:16:45 [Worker_37] LDAP Results (&(mail=keemin@LOCALDOMAIN)(mailUserStatus=active)): 0 : Mar-5-09 00:16:46 [Worker_8] LDAP Results (&(mail=paranth@ LOCALDOMAIN)(mailUserStatus=active)): 0 : Mar-5-09 00:16:46 [Worker_16] LDAP Results (&(mail=greenhub@ LOCALDOMAIN)(mailUserStatus=active)): 0 : Figure 4.3: Logs For Recipient Validation Filter Recipient Validation will show either “0” or “1” as shown in Figure 4.3 above. A value of 0 mean invalid recipient while value of 1 means valid recipient. 62 iv) Sender Reputation Mar-5-09 00:40:36 84833-00657 [Worker_23] 201.230.23.225 <thereseq@armkb.com> to: arbayah@tm.net.my [scoring] Blocked Country PE - Telefonica del Peru Mar-5-09 00:40:37 84835-09317 [Worker_20] 202.221.246.203 to: breeranna@bluehyppo.com [scoring] Blocked Country JP - Sumitomo Mitsui Banking Corporation Mar-5-09 00:40:39 84837-08322 [Worker_13] 86.54.102.131 <mail.ckxkigkhzrep@interiorsuae.msgfocus.com> to: benel_m@tm.net.my [scoring] Blocked Country GB - Adestra Mar-5-09 00:40:39 84836-04453 [Worker_9] 201.77.114.1 <marie79@msn.com> to: rezzack@bluehyppo.com [scoring] Blocked Country BR - Desktop Online Inform<E1>tica Ltda Mar-5-09 00:40:40 84837-01719 [Worker_41] 77.253.151.110 <tinaw@dropscout.com> to: ara@streamyx.com [scoring] Blocked Country PL - Netia SA Mar-5-09 00:40:40 84838-04337 [Worker_3] 189.41.78.29 <texi@malwaredomains.com> to: arcc@tm.net.my [scoring] Blocked Country BR Mar-5-09 00:40:43 84841-01628 [Worker_43] 81.222.243.196 <zapolar@km.ru> to: ykcheng@streamyx.com [scoring] Blocked Country RU - JSC FP Network Mar-5-09 00:40:43 84840-04668 [Worker_2] 92.124.181.82 <berthaJacob64@msn.com> to: pitt@bluehyppo.com [scoring] Blocked Country RU - OJSC Sibirtelecom Mar-5-09 00:40:44 84842-06900 [Worker_1] 203.216.226.119 <annahrichard005@yahoo.co.jp> to: ngadan@tm.net.my [scoring] Blocked Country JP - Yahoo Mar-5-09 00:40:46 84835-03436 [Worker_7] 220.231.207.235 <cgfjgjgjvcgdfb@126.com> to: ja3son8t@tm.net.my [scoring] Blocked Country CN - China Motion Network Communication Figure 4.4: Logs For Sender Reputation Filter Figure 4.4 above shows there was some “Blocked Country” but there was no blocking done but only doing scoring to the message. The score contributed by this filter will be sum with scores from other filters in the final process to classifying the message as spam or not. 63 v) Traffic Shaping Mar-5-09 02:25:47 91129-05313 [Worker_23] [PenaltyBox] 203.211.131.90 to: yeongwl@tm.net.my [monitoring] totalscore for 203.211.131.90 is 77, last penalty was 'DNSBLfailed' Mar-5-09 01:21:24 87277-02230 [Worker_48] [PenaltyBox] 207.58.183.173 <machines@myliberta.info> to: wellgrow@tm.net.my [monitoring] totalscore for 207.58.183.173 is 575, last penalty was 'IPfrequency' Mar-5-09 06:02:12 04114-07888 [Worker_13] [PenaltyBox] 66.114.120.16 <ibfx4@interbankfx.com> to: znsgroup@streamyx.com [monitoring] totalscore for 66.114.12 0.16 is 140, last penalty was 'SPFfail' Mar-5-09 00:16:37 83396-01564 [Worker_2] [MessageLimit] 24.232.0.248 <apiarg@fibertel.com.ar> to: lianhh@tm.net.my [spam found] (MessageScore 52, limit 50) [BATCH NUMBER 56T DTH78 SAR79]; Mar-5-09 00:16:40 83399-06116 [Worker_10] [MessageLimit] 122.217.103.9 <performerinfo@angel-live.com> to: saka0805@streamyx.com [spam found] (MessageScore 5 7, limit 50) [B j J h A c s BT C F 9 B]; Mar-5-09 00:16:41 83398-09333 [Worker_17] [MessageLimit] 89.6.178.75 <elemental@beahr.com> to: zahir@bluehyppo.com [spam found] (MessageScore 53, limit 50) [More orggasms]; Figure 4.5: Logs For Traffic Shaping Filter Figure 4.5 shows there were IP addresses filtered due to their previous bad records in sending e-mails to local server. Besides of that, each IP address was only allowed to send a limited number of e-mails within certain period of time. Spammers will always try to send out e-mails as much as possible within a short period. Therefore, this pattern can be recognized easily and limiting the number of e-mails to receive from one source will stop that type of spam. 64 Table 4.1 below shows the detail outcome of the test with using ASPBF as the filter: Table 4.1: Result of ASPBF Filter Total Traffic Shaping 8182 Sender Reputation 9142 (scoring only) SPF 540 DNSBL 17,772 Recipient Validation 41,898 Total 68,392 (without Sender Reputation) However, in the simulation, Sender Reputation was not used to block any connection but instead only used to contribute some scores to the message which later to be used for identifying spam similarities. Out of the total connection which was 90807, about 68392 connections were dropped or blocked. That means, about 75% of connections were dropped and not processed for their contents. Without taking Sender Reputation into consideration, the following Figure 4.6 is showing the effectiveness of the blocking: Figure 4.6: Result of Filter with ASPBF 65 The figure 4.6 shows that ASPBF will only allow 25% of total connections to go through with full body content of the messages. This is also showing that whatever filters that are placed behind ASPBF will require a lot less processing power as compared to filter without ASPBF. The next is a test with Bayesian without the ASPBF is described below. The test was conducted in a different machine. The filters only scanned messages smaller than 128Kb size. The bigger e-mails than that will pass through without any processing. Table 4.2: Result of Antispam without ASPBF Filter Total Pass 55979 No Processing 2250 Recipient Validation 26112 Spam (Bayesian) 42238 Spam (Content) 18775 Total 145354 Figure 4.7: Result of Filter Without ASPBF Based on Figure 4.7, although it scan only e-mail smaller than 128Kb, but those e-mail bigger than that are actually just about 1.5% out of total connection. 66 There are about 40% e-mails passed the filter, and 29% were filtered by Bayesian. Although only 29% is filtered by Bayesian, but the filter was actually scanning all incoming messages. Therefore the filter using Bayesian technique has actually worked on 83% out of total connection. 4.3. Analysis Figure 4.1 also prove that ASPBF was able to identify spam connections that could be dropped or rejected by detecting the pattern of the incoming traffic. It would be then leaving only legitimate connections to proceed with sending messages. Although ASPBF filtered about 75% connections, recipient validation was found to be an important component as it had contributed about half of the blocked connections. The traffic shaping also received input from this component to filter IP addresses that were doing a directory harvesting attack. That attack was to gain a list of users available in local e-mail server. Sender Policy Framework (SPF) is a new method and not yet implemented by most of e-mail providers, that is why Figure 4.6 only able to show about 1% of the traffic can be analysed for SPF records. Although SPF can be avoided in the filter, but it has also shown that there are e-mail providers using it to take care of their e-mail servers and to avoid forged e-mails. But Domain Name Service Black Listing (DNSBL) really showed that it was able to filter spam sources. Although the accuracy of the data can be questionable, but that data is a collection from various sources such as individual that receive spam, spam honeypot and more, in other words, some data is accurate and some sources that are blacklisted because they have previously sent spam. Therefore limiting connection from those blacklisted IP addresses is a preventive measures in blocking spam. Meanwhile, Figure 4.7 has shown that recipient validation does play an important role even though it can only block about 17% out of total connections. The 67 rest of the connection will be scanned by Content and Bayesian Filter, which only about 29% is detected by Bayesian. 4.4. Conclusions The test has given a result that shows the importance of each component of ASPBF and their contribution towards the success of ASPBF. Nevertheless, Bayesian filter is expected to require less processing load as the unnecessary traffics are dropped or rejected before they are fed to Bayesian filter for scanning. CHAPTER 5 DISCUSSION The development of ASPBF and its simulation have proven the theory to improve antispam filter. However, there are a lot more angle can be taken into consideration to further its effectiveness and robustness 5.1. Introduction This chapter is discussing the potential of ASPBF as a compliment component to existing antispam solution for a better of e-mail communication. 5.2. Discussion From the result, we can see there was only 25% of the traffic would be accepted and processed by Bayesian filter. This had shown by implementing ASPBF, it would help the other filters to work more efficiently and save processing load from the servers. However, if the antispam is only working at content level, the server might suffer the processing load due to each scanning and analysis will be required for each e-mail content. In addition, the server will be congested with concurrent connections due to its limitation in handling concurrent connections. This will end up with rejection to other incoming connections and might affect connections without spam. 69 It was proven that ASPBF was improving overall antispam process and functionalities, its existence compliments the other antispam filters. It also can run on its own and yet able to reduce spam from the network, eventhough it is not good to end users who might still receive spam due to no content filters. Figure 5.1 below show the potential outcome of the combination of ASPBF and Bayesian, taking into consideration that all connections that pass through ASPBF still contain about 40% spam. Figure 5.1: Expected outcome from the combination of ASPBF and Bayesian Figure 5.1 shows that if ASPBF is combined with other antispam filter, such as content and Bayesian filter, it will help to eliminate more spam and allow only a small amount of e-mails and this would help to make sure the existing e-mail storage is able to utilize its capacity for legitimate e-mails. As the spam e-mails were also terminated at network level, it helped to reduce congestion in the network which host the e-mail servers, thus the e-mail servers would accept more legitimate e-mails and prevent themselves from delaying e-mails. The lesser e-mails received will also save the processing load to scan each e-mail contents. All of these have shown the overall process to scan and eliminate spam has improved. 5.3. Conclusion The result shown in this chapter has proven that ASPBF has met its objectives. Although combining both ASPBF and Bayesian might give a better result, but performance and robustness of the application shall be taken into consideration. The way the solution is developed shall be studied further to have a more robust and reliable solution. CHAPTER 6 CONCLUSION The antispam has a potential to be improved by adding more components to its main system. But choosing the right components is crucial in ensuring the success of the solution. ASPBF has shown the potential choices to make in focusing only for network level protection. 6.1. Introduction This chapter concludes the study by showing the right direction of antispam in the future and the alternatives currently available in improving the antispam. 6.2. Conclusions Remarks The spam is still evolving and so the antispam solutions, but both are for different agenda, one is to reach users the other one is to prevent users from receiving spam. This competition will never end and will never satisfy everyone. The spammers will also look into the evolution of antispams in order to understand them and to avoid the detection and blocking. On the other hand, the antispam developer will work on the existing spam techniques in order to improve the filter to cater for the latest methods. The development of antispam is more towards a reactive measure. 71 However, it is hard to anticipate the next generation of techniques to send spam as those are only created based on the weaknesses of antispam. Therefore continuous efforts in improving antispam is the only way to fight the spam even though it will only be able to reduce spam rather than eliminate them in total. 6.3. Future Work ASPBF has a potential to be improved further especially by adding more techniques and filters to scan an e-mail or a message at header level. The objective of ASPBF is to filter at network level, but filtering messages at their header will also help in identifying the untrusted sources. This is another compliment component to ASPBF as well as to Bayesian, but requires further study to ensure its functionality will meet its objective without jeopardizing the performance. Here are several examples of techniques that can be used in filtering more connections through scanning the header, such as : i. validating the HELO command HELO or EHLO command is a common handshake mechanisme done by both sending and receiving servers. The genuine e-mail server provide its genuine hostname in the command. While, a spammer’s server will never send its genuine hostname to hide themselves from being recognized. Therefore, taking advantage of this command would lead to a better antispam. ii. validating the Message-ID Each message will be tagged with a message-ID by the sending server for recording and tracking purposes. But, a spammer will normally used a modified e-mail software to send spam directly to the receiving servers bypassing the sending servers. Thus, the legitimate e-mail will have a 72 message-ID in a common format, while the spam will not have it or have it but with wrong format. iii. validating the Forged-HELO. Another ways to provide a hostname in the HELO command is by using the receiving server’s domain name as the hostname. This can considered as forging the domain, thus detecting the forged-HELO can be considered as another solution. iv. DKIM verification Domain-Key Identification Management is another way to sign the e-mail and verify it has come from a trusted source. This method is new but already being implemented by Yahoo! and google. Further study is required to make use of this information to verify the genuinity of the message and source. Those techniques will work at the header level of the messages before the full content is accepted but it has passed all filters at network level. If the header fails the test, the connection is dropped and no data will be received and therefore no needs to process using Content and Bayesian filters. Although adding more component to the filter will improve overall processes, but one must bear in mind fighting spam is a continous effort. An obstacle should not stop the effort but shall be considered as a motivation for a better antispam in the future. 73 REFERENCES Bekman, S. (2006). Anti-SPAM Techniques: Bayesian Content Filtering. Retrieved on Mayh 15th 2006, from http://stason.org/articles/technology/e-mail/junkmail/bayesian_content_filtering.html Foxman, E.R, and Schiano, W.T. (2000). Inspecting Spam Unsolicited Communications on the Internet. Challenges of Information Technology Management in the 21st Century: 2000 Information Resources Management association International Conference. May 21-24, 2000. Anchorage, Alaska. 552. Karlberger, C., Bayler, G., Kruegel, C., and Kirda, E. (2007). Exploting Redundancy in Natural Language to Penetrate Bayesian Spam Filters. FWF Austrian Science Fund. 2007. Kim, H.J, Kim, H.N, Jung, J.J, and Jo, G.S (2004). Spam Mail Filtering System Using Semantic Enrichment. WISE 2004, LNCS 3306. November 22-24, 2004. Brisbane, Australia. 619-628. Liang, G., Li, T., Gong, X., Jiang, Y., Yang, J., and Ni, J. (2006). NASC: A Novel Approach for Spam Classification. ICIC 2006, LNBI 4115. August 16-19, 2006. Kunming, China. 672-681. Mehta, B., Nangia, S., Gupta, M., Nejdl, W. (2008). Detecting Image Spam using Visual Features and Near Duplicate Detection. International World Wide Web Conference Committee (IW3C2). April 21-25 2008, Beijing China. 497-506. McDonald, A. (2004). SpamAssassin: A Practical Guide to Integration and Configuration. Packt Publishing Ltd. Nhung, N.P., and Phuong, T.M. (2007). An Efficient Method for Filtering ImageBased Spam E-mail. Computer Analysis of Images and Patterns: 12th International Conference, CAIP 2007. August 27-29, 2007. Vienna, Autria. 945953. Prakash, V.V., and O’Donnel, A. (2005). Fighting Spam with Reputation Systems. Social Computing. Vol. 3 (Issue No 9). ACM. 74 Ramachandran, A. and Feamster, N. (2006). Understanding the Network-Level Behaviour of Spammers. SIGCOMM’06. September 11-15 2006, Pisa Italy. 291-302. Schwartz, A. (2004). SpamAssassin. California. O’Reily. Schultz, B. (2004). TurnTide Stopping Spammers at Their Own Servers. Network World. 2004. Retrieved on April, 26, 2004. From http://www.networkworld.com/nw200/2004/0426watch9.html Xiaochun Cheng, Xiaoqi Ma, Long Wang, and Shaochun Zhong (2005). A Mobile Agent Based Spam Filter System. Computational Intelligence and Security: International Conference, CIS 2005. December 15-19, 2005. Xi’an, China. 422427. Zdziarski, J.A. (2005). Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. San Francisco. No Starch Press, Inc. .