Auto-learning of SMTP TCP Transport-Layer Features for Spam and Abusive Message Detection Georgios Kakavelakis Naval Postgraduate School Robert Beverly Naval Postgraduate School Abstract Botnets are a significant source of abusive messaging (spam, phishing, etc) and other types of malicious traffic. A promising approach to help mitigate botnet-generated traffic is signal analysis of transport-layer (i.e. TCP/IP) characteristics, e.g. timing, packet reordering, congestion, and flow-control. Prior work [4] shows that machine learning analysis of such traffic features on an SMTP MTA can accurately differentiate between botnet and legitimate sources. We make two contributions toward the real-world deployment of such techniques: i) an architecture for real-time on-line operation; and ii) auto-learning of the unsupervised model across different environments without human labeling (i.e. training). We present a “SpamFlow” SpamAssassin plugin and the requisite auxiliary daemons to integrate transport-layer signal analysis with a popular open-source spam filter. Using our system, we detail results from a production deployment where our auto-learning technique achieves better than 95 percent accuracy, precision, and recall after reception of ≈ 1,000 emails. 1 Introduction “Botnets” are distributed collections of compromised networked machines under common control [7]. Automated methods scan, infect, or socially engineer vulnerable hosts in order to incorporate them into the botnet. Botnets provide a formidable computing and communication platform by harnessing the power of thousands, or even millions, of nodes for a common collective purpose [21]. Unfortunately, that purpose is often malicious and economically or politically motivated. As one common use scenario, botnets account for more than 85 percent of all abusive electronic mail (including spam, phishing, malware, etc) by one estimate [14]. Botnet-based spamming campaigns are large and long-lived [20], with more than 340,000 botnet hosts Joel Young Naval Postgraduate School involved in nearly 8,000 campaigns in one study [27]. The Messaging Anti-Abuse Working Group (MAAWG) coalition of service providers reported that across 500M monitored mailboxes in one quarter of 2007, 75 percent of all messages (almost 400 billion) were spam [18]. A subsequent 2010 MAAWG study reports the situation has worsened: abusive messages accounted for 89 percent of all electronic mail in a representative sample across many providers. Abusive message traffic abounds on the Internet. This deluge of unwanted traffic is more than a mere nuisance: a broad survey of large service providers finds that abusive messages account for the largest fraction of expended operational resources [1]. Despite extensive research and operational deployments, attackers and attacks have evolved at a rate faster than the Internet’s ability to defend. There remains ample room for improvement of in-production botnet attribution and mitigation. One promising approach for mitigating botnetgenerated abusive messaging is statistical traffic analysis. Prior work [4] shows that by using transport-layer traffic features, e.g. TCP retransmits, out-of-order packets, delay, jitter, etc., one can reliably infer whether the source of an email SMTP [16] flow is legitimate or originating from a member of a botnet. Botnets must send large volumes of abusive messages to remain financially viable. Because bots are frequently attached via asymmetric (low upload bandwidth) residential connections, they necessarily congest their local uplink – an effect that is remotely detectable. Perhaps most importantly, transport-layer classifiers are content (e.g. the words of the message itself) and IP reputation (e.g. blocklist) agnostic, facilitating privacy-preserving deployment even within the network core. Deployed on individual Mail Transport Agents (MTAs), such techniques can permit early-rejection of messages before application delivery, significantly reducing system load. Thus far, research in transport-layer classification has been offline, where experimental data is examined a pos- teriori. In this paper, we present the system engineering efforts required to integrate TCP transport features into the classification decisions of the popular open-source SpamAssassin [17] spam filter. A crucial obstacle to realizing such techniques is the ability to adequately train and build a model of normal and abusive traffic across a variety of operational environments. Rather than requiring human labeling or overly general models to build ground-truth, we exploit the auto-learning functionality of SpamAssassin. Our primary contributions include: Subsequent work from Hao et al. [11] demonstrates that AS alone as a feature may cause a large rate of false positives. They achieve better results by extracting lightweight features from network-level properties such as geodesic distance between sender and receiver, sender IP neighborhood density, probability ratio of spam to ham at the time of day the message arrives, the AS number of the sender, and the status of open ports on the sender machine. Further studies [15, 28] have shown that a spammer can evade such techniques by advertising routes from a forged AS number [11]. Schatzmann et al. [24] similarly focus on networklevel characteristics of spammers, but from the perspective of an AS or service provider. Their idea is to passively collect the aggregate decisions of a large number of e-mail servers that perform some level of prefiltering (e.g. blocklisting). Using passive flow collection to gather byte, packet, and packet size counts, this aggregated knowledge can enhance spam mitigation. Commercial vendors expend considerable effort dividing the Internet IP address space into regions, with particular attention given to identifying residential broadband addresses. By discriminating against residential hosts, the hope is to block traffic from nodes that should not be sourcing email in the first place. This approach is both brittle and raises architectural misgivings in the form of arbitrarily discriminating against classes of users without prior provocation. Such residential blocking may have implications on notions of network neutrality as neutrality legislation catches up with technology. In contrast to these spam detection and mitigation techniques, Beverly and Sollins [4] present a content and IP reputation agnostic scheme based on statistical signal analysis of the transport (TCP) traffic stream. The premise is that spammers must send large volumes of email to be effective, causing constituent network links to experience contention and congestion. Such congestion effects are particularly prominent for many botnet hosts which reside on residential broadband connections where there are large gateway buffers [12] and asymmetric bandwidth. Transport-layer properties such as the number of lost segments and round trip time (RTT) therefore exhibit different distributions, permitting discrimination between spam and legitimate behavior. Among many TCP features, their analysis found that RTT and minimum-congestion window are the most discriminatory. This transport-only classifier exhibits more than 90 percent accuracy and precision on their data. Follow-on work to [4] explore similar ideas, including the use of lightweight single-TCP/SYN passive operating system signatures at the router-level [10]. Ouyang et al. [19] conduct a large-scale empirical analysis of transport-layer characteristics on over 600,000 messages. Among tested features, their analysis similarly finds the 1. On-line and real-time transport-layer classification of live email messages on a production MTA. 2. Auto-learning of transport features to automatically learn the unsupervised model across different operating environments without human training. The remainder of this paper describes related work (§2). From this foundation, we describe our system architecture and testing methodology (§3). We present production deployment results in §4 and discuss their implications (§5). We conclude by outlining future work. 2 Related Work Recent research efforts have shown great promise in understanding the character and behavior of botnets. While these proposed solutions are currently effective, they frequently rely on brittle heuristics and unreliable indicators. For instance, Xie et al. provide a system [27] to identify and characterize botnets using an automatic technique based on discerning spam URLs in email. Other research relies on IP addresses as indicators [29]. However, malicious botnet IP addresses are highly dynamic as new hosts are compromised, existing hosts receive new DHCP leases, or sources are spoofed [3]. Indeed, “fresh” IP addresses, i.e. those not in real-time blocklists, are a valuable commodity. Similarly, DNS is a poor identifier of malicious hosts given the prevalence of botnets employing DNS fast-flux [5] techniques to distribute load among redirectors, survive node failures, and obfuscate back-end hosting infrastructure. A large body of work examines network-layer (IP) properties of botnets. Ramachandran et al. [22] characterize spamming behavior by correlating data collected from three sources: a sinkhole, a large e-mail provider, and the command and control of a Bobax botnet. By focusing on network-level properties including: i) IP address space from which spam originates; ii) the autonomous system (AS) that sent spam messages to their sinkhole; and iii) BGP route announcements, they show that spam and legitimate e-mail originate from the same portion of the IP address space. Thus, IP addresses are not a reliable indicator of malicious or abusive nodes. 2 three-way-handshake latency, time-to-live (TTL), and inter-packet idle time and variance most discriminating for ham versus spam. These features remain stable over time, yielding 85-92 percent classification accuracy. Based on the encouraging results of this body of prior work, we endeavor to take a step toward the real-world deployment of transport-classifier based botnet detection and abusive traffic mitigation techniques. MTA (postfix) email Spam Assassin score SMTP Traffic features prediction features msgid SpamFlow Model packets System Architecture Figure 1: SpamFlow system architecture: transport-layer features are aggregated on a per-flow basis. The SpamFlow SpamAssassin plugin uses XML-RPC to obtain each message’s feature vector which is then sent to the classifier. Predictions are relayed to the plugin and integrated into the final SpamAssassin message score. The TCP/IP network stack logically divides functionality between layers. As a result, applications do not normally have access to lower-layer features. For example, TCP (implemented in the kernel or lower) provides an abstraction of a reliable and in-order data stream to the application via a socket interface. Applications are removed from the details of packet arrival timing, ordering, TTL, etc. Thus, our design must collect, on a permessage basis, transport-layer traffic characteristics and expose them up the stack to the SpamFlow (SF) plugin. This section describes our system architecture and the interaction between various components: SpamAssassin, SpamFlow, and the SpamFlow plugin. 3.1 Classifier SF Plugin pcap 3 msgid 3.2 SpamAssassin SpamAssassin [17] is an open-source, rule and content learning-based spam filter. Each rule is assigned a weight by a perceptron algorithm [25] and then the weighted scores are summed to produce an overall score for each message. The classification process involves comparing the overall score with a user-defined threshold t (which defaults to a value that maximized performance on a broadly representative training sample). If the score is above t, then the message is classified as spam; otherwise, as legitimate. SpamAssassin is modular and extensible for adding other filtering techniques. Popular plugins include real-time block lists (RBLs), domain-keys, permit lists, collaborative filtering, learning-based techniques (e.g. naı̈ve Bayes), and others. Overview We start with an overview of our SpamFlow system architecture, shown in Figure 1. For clarity of exposition, we describe all functionality as being co-located with the Mail Transport Agent (MTA); however, the components can easily be distributed across different machines. The system is comprised of four main components: SpamAssassin, the SpamFlow traffic feature extraction engine, the SpamFlow plugin, and the classification software – referred to as SpamAssassin, SpamFlow, SF plugin, and classifier respectively. Every message received by the MTA is processed by SpamAssassin and then piped to the plugin. Simultaneously, SpamFlow continuously and promiscuously listens on the network interface, capturing SMTP packets via the pcap API [13], aggregating packets into flows, and computing the relevant traffic statistics (e.g. TCP retransmits, out-of-order packets, delay, jitter, etc.). The plugin queries SpamFlow with the message’s identifier in order to retrieve the flow-level transport features corresponding to that message. Next, the plugin sends the message’s transport feature vector to the classifier. In response, the classifier returns a binary or probabilistic prediction (depending on the classifier employed) that then influences the final score of the message, and hence the final disposition. We describe each component in more detail in the following subsections. Furthermore, SpamAssassin features a thresholdbased mode in which new exemplar emails trigger an automatic retraining process. While SpamAssassin refers to this retraining as “auto-learning,” this is typically known as “online” or “iterative” learning in machine learning. The primary difference is that advanced iterative learning approaches modify the classification model to account for new emails, whereas in auto-learning the entire model is rebuilt each time. In SpamAssassin autolearning, a previously unseen message is used to retrain the model if it receives a score greater than τ + (assumed spam) or less than τ − (assumed non-spam). For example, when a message exceeds these threshold values, SpamAssassin rebuilds the model of the built-in naı̈ve Bayes classifier, and classifies subsequent messages with the newly updated model. 3 --- src/smtpd/smtpd.c.orig +++ src/smtpd/smtpd.c @@ -2807,9 +2807,9 @@ */ if (!proxy || state->xforward.flags == 0) { out_fprintf(out_stream, REC_TYPE_NORM, "Received: from %s (%s [%s])", + "Received: from %s (%s [%s:%s])", state->helo_name ? state->helo_name : state->name, state->name, state->rfc_addr); + state->name, state->rfc_addr, state->port); --- received.c.orig +++ received.c @@ -44,2 +44,3 @@ char *remoteip; +char *remoteport; char *remotehost; @@ -63,2 +64,5 @@ safeput(qqt,remoteip); + remoteport = getenv("TCPREMOTEPORT"); + qmail_puts(qqt,":"); + safeput(qqt,remoteport); qmail_puts(qqt,")\n by "); Figure 2: Postfix modification to support traffic identifiers Figure 3: qmail modification to support traffic identifiers 3.3 SpamFlow small and straightforward. For reference we provide the code changes for the popular Postfix and qmail MTAs in Figures 2 and 3. SpamFlow [4] is our network analyzer. Using libpcap [13], SpamFlow promiscuously listens on the network interface and builds source host/port flows (the destination MTA address is constant and known and thus not part of the flow tuple). As SMTP flows complete, either via an explicit TCP termination handshake or via timeout, SpamFlow extracts transport-layer features for each as detailed extensively in [4]. SpamFlow listens for XML queries for a particular flow’s IP and port, responding in kind with the features for that flow. We explored two options for uniquely identifying messages to correlate between messages and their constituent flow data. First, every message contains a unique message string (“Message-ID” in the header) [23] to facilitate replies, threading, etc. Using deep packet inspection, SpamFlow could reassemble email messages from the packet payloads to uniquely identify each flow by Message-ID. The immediate downside to using the message identification field is that doing so removes the benefit of only examining packet header statistics: namely privacy and efficiency. Instead, we opt to follow a simpler approach and use remote host IP address and ephemeral port number as the message identifier. These fields are readily available without any transport reassembly and are, in general, unique. Naturally, IP address and port tuples are reused (there is a maximum of only 216 unique TCP client-side ephemeral ports). For a tuple collision to occur in SpamFlow, two identical flows must arrive within less time than the messages can be delivered to the MTA and processed by SpamAssassin, i.e. on the order of a few seconds. Not only is this in violation of the TCP time wait procedure, we do not observe any duplicate flows within such short time periods in our empirical data. The final detail is how to expose the message identifier to the plugin so it can query SpamFlow. We modify our MTA server to add the (IP address, TCP port) identification tuple of the remote MTA to the header of each incoming e-mail. The actual MTA code modifications are 3.4 SpamFlow Plugin SpamFlow does not operate as a standalone MTA or spam classifier. Therefore, we integrate it with an existing one. We select SpamAssassin [17] because it is open source and widely used; for instance, the commercial Barracuda [2] network appliance is based on SpamAssassin. Importantly, SpamAssassin employs a modular architecture that allows developers to extend its functionality through plugins. As SpamAssassin is written in Perl, we develop a small, lightweight SpamAssassin Perl plugin tying the various components of Figure 1 together. In real-time, as e-mail messages are routed through the SpamFlow plugin, it scores them using a previously learned model of transport features. This score, in combination with the scores from other rules, provides a final message disposition. The plugin acts as the controller of the system and binds the traffic analysis engine and the classifier together. First, the plugin provides SpamFlow with the 2tuple identifier of the message under inspection and receives in return the corresponding message’s transportlayer features. After obtaining the features, the plugin passes them to a logically distinct machine learning classifier and retrieves the corresponding prediction. Figure 4 shows an example where the MTA added the message identifier (here, 77.239.18.226:37689) and the plugin attached SpamFlow’s transport feature vector to the message’s headers. Between components, we use XML-RPC [26] to communicate. XML-RPC is a simple protocol that allows communication between procedures running in different applications or machines. Specifically, the client uses the HTTP-POST request to pass data to the server; the server in return sends an HTTP response. In our implementation, we register the classifier with a classify procedure that takes as input the features. Thus, the plugin 4 From Josephine@rsi.com Tue Feb 01 23:21:58 2011 Return-Path: <Josephine@rsi.com> X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on ralph.rbeverly.net X-Spam-Level: ****** X-Spam-Status: Yes, score=6.9 required=5.0 tests=BAYES_50,RCVD_IN_XBL,HTML_MESSAGE, SPAMFLOW, UNPARSEABLE_RELAY autolearn=no version=3.3.1 X-Spam-Spamflow-Tag: 3792891725:37689,12,10,0,0,0,0,1,1,0,53248,34.464852,0.162818, 120.441156,148.297699,51.891697,5840,48,1,64 Received: (qmail 30920 invoked from network); 1 Feb 2011 23:21:57 -0000 Received: from cm-static-18-226.telekabel.ba (77.239.18.226:37689) Received: from vdhvjcvivjvbwyhscvfwq (192.168.1.185) by bluebellgroup.com (77.239.18.226) with Microsoft SMTP Message-ID: <4D489025.504060@etisbew.com> Date: Wed, 2 Feb 2011 00:20:48 +0100 From: Essie <Essie@hermes.com> User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Figure 4: Example email message headers with transport features added by the SpamFlow system sends the HTTP-POST request with the name of the classify procedure along with the features, as comma separated values forming a string1 , and receives via HTTP response the classification prediction from the classifier. Not only is XML-RPC simple and standardized, it allows the classifier to potentially operate on a different machine from SpamFlow, which in the future could allow the XML-RPC classifier to serve many SpamFlow instances in a multi-threaded fashion and distribute load. Further, all popular programming languages provide XML-RPC APIs, notably allowing us to use our language of choice for the various tasks. In our specific implementation, we develop SpamFlow in C++ while the classifier is a Python daemon. 3.5 ment in order to understand its practical feasibility. We then detail performance results using auto-learning of transport features in a live production environment. 4.1 To understand the system-level performance of our SpamFlow design as outlined in §3, we create the controlled testing environment depicted in Figure 5. One host runs the SpamFlow system and is physically connected to a second traffic sourcing host. The traffic sourcing host implements our custom e-mail “replayer” application and a modified Dummynet [6] network emulator. The replayer reads from the TREC public email corpus [8] of 92,187 messages, of which 52,788 are spam and 39,399 are legitimate. For each message, the replayer: 1) extracts the headers and adds as recipient a valid user of our virtual-network domain; 2) establishes an SMTP session with the MTA (Postfix) of the SpamFlow system under test; 3) sets the differentiated services code point (DSCP) in the IP header of each message according to the ground truth label (spam or ham); 4) uses the standard SMTP protocol to transmit the message. We set the DSCP differently for spam and non-spam messages in order to influence the emulated network behavior. Our goal is to coarsely simulate the characteristics that botnet-generated spam traffic exhibits, such as TCP timeouts, retransmissions, resets, and highly variable RTT estimates. For our evaluation, we select Dummynet [6], a publicly-available tool that enables introduction of delay, loss, bandwidth and queuing constraints, etc. for packets passing through virtual network links. In our testing setup, Dummynet applies different queuing, scheduling, bandwidth, delay, loss, etc. depending on the DSCP bits which correspond to email type. Dummynet emulates a only fixed propagation delay. We therefore modify it to generate random delays drawn from a normal distribution with a mean delay of µ = 150ms with σ = 50ms standard deviation for spam Classification Engine The final component of the system architecture is the traffic classification engine which we implement using the open source Orange [9] machine learning and data mining Python package. While the details of the machine learning algorithms are out of scope for this paper, we note that Orange includes a variety of algorithms and statistical modules for performance evaluation. Our classifier implementation experiments with three machine-learning algorithms: naı̈ve Bayes, decision trees (specifically, the C4.5 algorithm), and support vector machines (SVM). These three algorithms are broadly representative of different classes of learning strategies and allow us to evaluate both system classification performance, generality, and system speed. 4 Load Testing Results This section first describes results from load testing the SpamFlow system in a controlled laboratory environ1 The CSV string is used for expediency; in the future, we plan to use individual XML identifiers for each feature. 5 capture Dummynet Email Replayer MTA Spamassassin Table 1: SpamFlow training time (sec) as a function of classifier type and sample size Training Samples Classifier 10 100 1000 10,000 Naive Bayes 0.88 15.02 105.45 104.84 C4.5 0.15 0.96 16.02 29.80 SVM 0.72 12.69 224.25 260.02 Feature Agg Classifier Corpus Figure 5: Laboratory testing environment: enabling tightly controlled and easily configurable repeatable experiments. The replayer application replays an email corpus while Dummynet emulates different network behavior to mimic botnet and legitimate traffic. Using the testbed, we load test and debug SpamFlow. 4.2 Production Environment Live testing is important because it reveals how the system interacts with possibly unknown features of the external environment. We deployed our system in a live environment at our university for a small domain from January 25, 2011 to March 2, 2011 and collected a trace of 5,926 e-mail messages. Ground truth was first established using an unmodified SpamAssassin version 3.3.1 instance without transportlayer traffic features, i.e. with only the default built-in rules and content analysis. We then manually examined all the legitimate ham messages and relabeled those that were false negatives. We manually sampled the spam messages to eliminate false positives and establish reasonable ground truth. While the volume of traffic captured is small, our intent in this experiment is to establish the ability to auto-learn the transport-layer features in a production environment and ascertain the resulting classification performance. We envision larger-scale, highervolume live testing in the future. Auto-learning is the incremental process of building the classification model based on exemplar e-mail messages whose scores exceed certain threshold values. In our case, we use features of e-mail messages otherwise classified via orthogonal methods as having very high or very low scores (for instance, those emails whose content triggers many of SpamAssassin’s rule-based indicators). Specifically, we explicitly retrain the classifier’s model each time a new message obtains an especially high or low score from the other SpamAssassin methods (ruleand Bayesian-word based); i.e. a score above or below set thresholds. After retraining is complete, we evaluate performance iteratively on subsequent messages until a new message arrives with a score above or below the threshold, triggering retraining again. Our thresholds selection is based on empirical spam and ham SpamAssassin score distributions. Spam message scores follow a normal distribution with µ = 16.3 and σ = 7.7, whereas scores of legitimate messages have a mean of µ = 1.3, but are skewed left. Therefore, for the legitimate messages we first experiment with a threshold τ + = 16 and τ − = 1, which allows the classifiers to be trained on an approximately even fraction of training and test examples: a total of 2,683/5,590 (48.0%) spam and traffic that originates from the replayer, and a µ = 40ms delay with σ = 25ms for legitimate traffic in both directions. We introduce delay in legitimate traffic in order to avoid overfitting our learned statistical model. These delays need not be precise as they are intended to merely mimic a congested environment. To emulate timeouts, retransmissions, and resets, we apply a random-packetdrop policy on the Dummynet pipe. Note that we disable all SpamAssassin rules requiring network access, e.g. real-time blocklists, as such rules are dynamic and thus sensitive to dates and time. While we recognize that our modifications to Dummynet only partially emulate a congested network (for example, loss events are independent – an assumption that does not hold true in a real queue), our goal in the emulation environment is to enable reproducible testing. Thus, we use the environment to emulate high-rate traffic and evaluate performance, throughput, system load, etc. on representative traffic. Section 4.3 goes on to detail real-world performance on live production traffic. Table 1 shows the performance of the three classifiers with respect to training time. C4.5 has the smallest training time. SVM, on the other hand, has the largest training time, due to the more complex decision model. We then examine throughput: the rate at which the system is able to classify and process emails from the replayer. Naı̈ve Bayes, C4.5, and SVM achieve 1,300, 1,000, and 700 messages per second throughput respectively in our environment. Naı̈ve Bayes provides the highest throughput, likely due to its simple decision rule. Many factors impact throughput; our intent is to understand the relative performance of each classifier and to establish real-world feasibility. The takeaway from these measurements is that, taking into account the relative independence of our system from the classification method, we can select the classification model that fits our needs. For example, the low training time of C4.5 makes it a good candidate when we need to retrain often. 6 of very good performance using only transport-layer features. Of note is that this level of performance is achieved after receiving only 100-200 messages. The weakness in SpamFlow only using traffic characteristics appears in the specificity, Figure 6(e), where false positives drive our best specificity down to approximately 75 percent. 296/436 (67.9%) ham messages. We canonically call spam a “positive” and ham a “negative” to indicate disposition. Correct predictions result in either a true positive (t p) or true negative (tn). A spam message that is mispredicted as ham produces a false negative ( f n), while a ham message misclassified as spam produces a false positive ( f p). Note that false positives in email filtering are particularly expensive for users as there is a high cost to missing or discarding legitimate messages. As performance metrics, we consider accuracy, precision, recall, specificity, and F-score: accuracy = precision = recall = speci f icity = F − score = t p + tn t p + f p + tn + f n tp tp+ f p tp tp+ fn tn f p + tn precision ∗ recall 2 precision + recall To better understand the sensitivity of our autolearning results to the imposed thresholds τ, we experiment with a spam threshold two deviations above the mean: τ + = 30. By increasing the spam threshold, the SpamFlow auto-learning uses fewer spam-training examples. However, we expect to have higher confidence in their true disposition of spam with the higher threshold. Important to our evaluation, τ + = 30 has the effect of balancing the training complexion so that there is not a strong class prior: 227 exemplar spam messages and 296 exemplar ham messages. (1) (2) (3) With the spam score threshold raised to τ + = 30, Figure 6(b) shows that the spam prior is now close to 50 percent, removing any training class bias. SVM and naı̈ve Bayes still achieve greater than 90 percent accuracy. Again, clearly the auto-learning behavior is working with performance steadily increasing over time and greatly outperforming the spam prior. As with the lower threshold, Figure 6(d) demonstrates very high F-Scores for all of the classifiers. (4) (5) All of these metrics are important to consider to properly understand system performance. For instance, accuracy is misleading if the underlying class prior is heavily skewed: if 95% of the messages are in fact spam, then a deterministic classifier that always predicts “spam” will achieve seemingly high 95% accuracy without any learning. Precision therefore measures, among messages predicted to be spam, the fraction that are truly spam. Recall measures the influence of misclassified spam messages, i.e. is a metric of the classifier’s ability to detect spam. Specificity, or true negative rate, determines how well the classifier is differentiating between false positives and true negatives. Finally, because there is a natural tension between achieving high precision and high recall, a common metric is F-Score which is simply the harmonic mean of precision and recall. 4.3 Figure 6(f) highlights the challenge in false positives. However, the most specific classifier, the decision tree algorithm, is also highly accurate and precise. With machine learning there is an inherent trade off between achieving very high true positive rates and keeping false positive rates low. Our results demonstrate the best compromise with the higher auto-learning threshold and the use of decision trees. Finally, we perform an initial investigation into whether the combined votes of SpamAssassin and SpamFlow lead to overall improved performance. We experiment with adding 0.2 (experiment 1) and with adding 1.0 (experiment 2) to the final score if SpamFlow predicts a spam message on the basis of transport traffic characteristics. Otherwise, we subtract 1.0 from the final score. This crude weighting does not leverage SpamFlow’s confidence in the prediction, and does not properly weight the vote in accordance with SpamAssassin’s other rules. We leave complete integration of SpamFlow’s predictions with SpamAssassin’s voting as future work. Production Testing Figure 6 shows the classification performance metrics of the three classifiers we implement in SpamFlow as a function of cumulative training samples received. Figure 6 therefore depicts the classifiers’ auto-learning over time as new exemplar training messages are received. Figure 6(a) displays cumulative accuracy for each classifier over time and includes the spam prior. The spam prior is simply the fraction of all training emails that are spam. A naı̈ve classifier could simply predict the prior, so values above the prior indicate true learning. We observe both decision trees and SVMs providing greater than 95 percent accuracy. Figure 6(c) similarly shows decision tree and SVM providing high F-scores, indicative Table 2 shows the confusion data for SpamAssassin alone, SpamFlow alone, and the combination. In the first combined vote, we achieve better performance with the same number of false positives. In the second combined vote, we achieve even better performance, but at the cost of false positives. In all cases, the combination increases the overall F-score. 7 1.0 0.8 0.8 Classification Accuracy Classification Accuracy 1.0 0.6 0.4 Spam Prior Naive Bayes Decision Tree SVM 0.2 0.0 0 10 101 102 Incoming Email Number 0.6 0.4 0.2 0.0 0 10 103 1.0 1.0 0.8 0.8 0.6 0.4 0.0 0 10 102 Incoming Email Number 1.0 Naive Bayes Decision Tree SVM 0.8 0.4 0.2 101 102 Incoming Email Number 101 102 Incoming Email Number 103 (d) F-Score (τ + = 30, τ − = 1) 0.6 0.0 0 10 Naive Bayes Decision Tree SVM 0.4 0.0 0 10 103 Classification Specificity (TNR) Classification Specificity (TNR) 0.8 103 0.6 (c) F-Score (τ + = 16, τ − = 1) 1.0 102 Incoming Email Number 0.2 Naive Bayes Decision Tree SVM 101 101 (b) Accuracy (τ + = 30, τ − = 1) Classification F-score Classification F-score (a) Accuracy (τ + = 16, τ − = 1) 0.2 Spam Prior Naive Bayes Decision Tree SVM 0.6 0.4 0.2 0.0 0 10 103 (e) Specificity (τ + = 16, τ − = 1) Naive Bayes Decision Tree SVM 101 102 Incoming Email Number 103 (f) Specificity (τ + = 30, τ − = 1) Figure 6: Auto-learning classification results for three SpamFlow classifiers on live production traffic as a function of cumulative exemplar training messages received. 8 from congestion induced by other nodes and other applications. We believe there will remain adequate discriminatory signal to discern botnet hosts. Even when SpamFlow does mispredict, our results show that combining SpamFlow with other classifiers leads to improved performance and can overcome instances of false positives by individual classifiers. Table 2: Confusion data comparing SpamAssassin performance with and without SpamFlow auto-learning tp fp tn f n F-Score SpamAssassin 5288 3 137 87 0.991 SpamFlow 5224 65 75 151 0.980 SA+SpamFlow(1) 5299 3 137 76 0.992 SA+SpamFlow(2) 5335 19 121 40 0.995 6 5 Discussion Conclusions and Future Work This research implemented the necessary infrastructure to perform real-time, on-line transport-layer classification of email messages. We plan to distribute our system as part of the third-party SpamAssassin plugin library in order to facilitate widespread deployment, impart impact on abusive messaging traffic, and to refine the system. We detail the system architecture to integrate network transport features with SpamAssassin, an MTA, and a classification engine. Our testing reveals that the system can handle realistic traffic loads. Of note, we tackle the bootstrapping problem of obtaining representative network traffic on a per-network basis by leveraging autolearning to automatically train on exemplar messages. Using our techniques, we achieve accuracy, precision, and recall performance greater than 95 percent after receiving only ≈ 210 messages during live, real-world production testing. We emphasize that these results come from observing only network traffic features; in actual deployment, the SpamFlow plugin will, as with other parts of the SpamAssassin system, place a weighted vote. Overall performance will likely improve using traditional features in addition to network traffic features. We note, however, that our live-testing corpus is small. Our intent in this work was to demonstrate the practical feasibility of using transport network traffic features. In future work, we plan to investigate SpamFlow’s performance and scalability in large, production systems against much larger volumes of traffic. Our hope is to enable the practical deployment of transport-layer based abusive traffic detection and mitigation techniques to system administrators. Finally, we observe that the distributed computing platform offered by botnets enables a wide variety of attacks and scams beyond abusive email. Beyond messaging abuse, botnets are employed in phishing attacks, scam infrastructure hosting, distributed denial-of-service (DDoS) attacks, and more. For example, some botnets effectively provide a Content Distribution Network (CDN) for hosting scam infrastructure. Botnet CDNs are used to host web sites (e.g. landing sites for ordering prescription pharmaceuticals or redirection servers), distribute malicious code, and a variety of other nefarious purposes. Still other botnets are employed to perform dictionary attacks against servers, brute force or other- Can spammers adapt and avoid a transport-based classification scheme? By utilizing one of the fundamental weaknesses of spammers, their need to send large volumes of spam on bandwidth constrained links, we believe SpamFlow is difficult for spammers to evade. A spammer might send spam at a lower rate or upgrade their infrastructure in order to remove congestion effects from their flows. However, either strategy is likely to impose monetary and time costs on the spammer. Of note is that our techniques work equally well in IPv6 as the TCP transport-layer characteristics SpamFlow relies on in IPv4 are the same in IPv6. The fact that SpamFlow is IP address agnostic suggests that it may be an even more important technique in an IPv6 world where the large address space is difficult to reliably map. One possible limitation of SpamFlow is that it may be unable to distinguish between a botnet host sending large volumes of spam and traffic from a host that is simply busy, or on a congested subnetwork. However, other transport-layer features are decoupled from congestion, for instance a CPU-bound bot host will perform TCP flow control and advertise a small receiver window – an effect that SpamFlow uses as part of its decision process. Further, SpamFlow detects hosts that send volumes of email that exceed the local uplink and processing capacity. Personal, home or small business servers do not have the same volume requirement as spammers and thus are unlikely to induce the same TCP congestion effects we observe. In reality, there is a value judgment that makes SpamFlow practical and reasonable. Specifically, users who wish to ensure that their emails are delivered typically invest in suitable infrastructure, contract with an outside provider or use their service provider’s email systems. Companies are not sourcing large amounts of crucial email from hosts attached by consumer-grade connections. The vast majority of home users utilize their provider’s email infrastructure or employ popular web-based services. Thus, SpamFlow only discriminates against sources that are both poorly connected and injecting large volumes of mail. However, in future work, we plan to experiment with the sensitivity of SpamFlow to false positive originating 9 wise solve CAPTCHAs [30], etc. in order to create accounts on social network sites and further spread abusive traffic via multiple distribution channels. We believe transport-layer techniques generalize to any botnet generated traffic, including phishing attacks, scam infrastructure hosting, DDoS, dictionary attacks, CAPTCHA solvers, etc. In future research, we wish to investigate using transport-level traffic analysis to identify a variety of botnet attacks and bots themselves. [12] H ÄT ÖNEN , S., N YRHINEN , A., E GGERT, L., S TROWES , S., S AROLAHTI , P., AND KOJO , M. An experimental study of home gateway characteristics. In Proceedings of the 10th annual conference on Internet measurement, pp. 260–266. [13] JACOBSON , V., L ERES , C., AND M C C ANNE , S. Tcpdump, 1989. ftp://ftp.ee.lbl.gov. [14] J OHN , J. P., M OSHCHUK , A., G RIBBLE , S. D., AND K RISH NAMURTHY, A. Studying spamming botnets using botlab. In Proceedings of USENIX NSDI (Apr. 2009). [15] K ARLIN , J., F ORREST, S., AND R EXFORD , J. Autonomous security for autonomous systems. Computer Networks 52, 15 (2008). Complex Computer and Communication Networks. Acknowledgments [16] K LENSIN , J. Simple Mail Transfer Protocol. RFC 5321 (Draft Standard), Oct. 2008. The authors would like to thank Geoffrey Xie, Le Nolan, Ryan Craven, and the anonymous reviewers. Special thanks to our shepherd Avleen Vig for invaluable feedback. This research was partially supported by a Cisco University Research Grant and by the NSF under grant OCI-1127506. Views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NSF or the U.S. government. [17] M ASON , J. Filtering spam with spamassassin. In Proceedings of SAGE-IE (Oct. 2002). [18] M ESSAGING A NTI -A BUSE W ORKING G ROUP. Email metrics report, 2011. http://www.maawg.org/about/EMR. [19] O UYANG , T., R AY, S., R ABINOVICH , M., AND A LLMAN , M. Can network characteristics detect spam effectively in a standalone enterprise? In Passive and Active Measurement (2011). [20] PATHAK , A., Q IAN , F., H U , Y. C., M AO , Z. M., AND R AN JAN , S. Botnet spam campaigns can be long lasting: evidence, implications, and analysis. In SIGMETRICS ’09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems (2009), ACM, pp. 13–24. References [21] R AJAB , M. A., Z ARFOSS , J., M ONROSE , F., AND T ERZIS , A. My botnet is bigger than yours (maybe, better than yours): why size estimates remain challenging. In HotBots’07: Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets (2007), USENIX Association, pp. 5–5. [1] A RBOR N ETWORKS. Worldwide infrastructure security report, 2010. http://www.arbornetworks.com/report. [2] BARACUDA N ETWORKS. Baracuda spam and virus firewall, 2011. http://www.barracudanetworks.com/. [3] B EVERLY, R., B ERGER , A., H YUN , Y., AND CLAFFY, K . Understanding the efficacy of deployed internet source address validation filtering. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference (2009), IMC ’09. [22] R AMACHANDRAN , A., AND F EAMSTER , N. Understanding the network-level behavior of spammers. In Proceedings of ACM SIGCOMM (Sept. 2006). [23] R ESNICK , P. Internet Message Format. RFC 2822 (Proposed Standard), Apr. 2001. [4] B EVERLY, R., AND S OLLINS , K. Exploiting transport-level characteristics of spam. In Proceedings of the Fifth Conference on Email and Anti-Spam (CEAS) (Aug. 2008). [24] S CHATZMANN , D., B URKHART, M., AND S PYROPOULOS , T. Inferring spammers in the network core. In Proceedings of the 10th International Conference on Passive and Active Network Measurement (2009), pp. 229–238. [5] C AGLAYAN , A., T OOTHAKER , M., D RAPAEAU , D., B URKE , D., AND E ATON , G. Behavioral analysis of fast flux service networks. In CSIIRW ’09: Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research (2009). [25] S TERN , H. Fast spamassassin score learning tool, Jan. 2004. http://svn.apache.org/repos/asf/spamassassin/ trunk/masses/README.perceptron. [6] C ARBONE , M., AND R IZZO , L. Dummynet revisited. SIGCOMM Comput. Commun. Rev. 40 (April 2010), 12–20. [26] W INER , D. XML-RPC specification, Apr. 1998. http://www. xmlrpc.com/spec. [7] C OOKE , E., JAHANIAN , F., AND M C P HERSON , D. The zombie roundup: Understanding, detecting, and disrupting botnets. In Proceedings of USENIX Steps to Reducing Unwanted Traffic on the Internet (SRUTI) Workshop (July 2005). [27] X IE , Y., Y U , F., ACHAN , K., PANIGRAHY, R., H ULTEN , G., AND O SIPKOV, I. Spamming botnets: signatures and characteristics. SIGCOMM Comput. Commun. Rev. 38, 4 (2008), 171–182. [8] C ORMACK , G., AND LYNAM , T. TREC public email corpus, 2007. http://trec.nist.gov/data/spam.html. [28] Z HAO , X., P EI , D., WANG , L., M ASSEY, D., M ANKIN , A., W U , S. F., AND Z HANG , L. An analysis of bgp multiple origin as (moas) conflicts. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement (2001), pp. 31–35. [9] D EMSAR , J., Z UPAN , B., L EBAN , G., AND C URK , T. Orange: From experimental machine learning to interactive data mining. In Principles of Data Mining and Knowledge Discovery (2004). [29] Z HAO , Y., X IE , Y., Y U , F., K E , Q., Y U , Y., C HEN , Y., AND G ILLUM , E. Botgraph: large scale spamming botnet detection. In NSDI’09: Proceedings of the 6th USENIX symposium on Networked systems design and implementation (2009), pp. 321–334. [10] E SQUIVEL , H., M ORI , T., AND A KELLA , A. Router-level spam filtering using tcp fingerprints: Architecture and measurementbased evaluation. In Proceedings of the Sixth Conference on Email and Anti-Spam (CEAS) (2009). [30] Z HU , B. B., YAN , J., L I , Q., YANG , C., L IU , J., X U , N., Y I , M., AND C AI , K. Attacks and design of image recognition captchas. In Proceedings of the 17th ACM conference on Computer and communications security (2010), CCS ’10. [11] H AO , S., S YED , N. A., F EAMSTER , N., G RAY, A. G., AND K RASSER , S. Detecting spammers with snare: spatio-temporal network-level automatic reputation engine. In Proceedings of the 18th conference on USENIX security symposium (2009). 10