MEF: Malicious Email Filter A UNIX Mail Filter that Detects

advertisement
MEF: Malicious Email Filter
A UNIX Mail Filter that Detects Malicious Windows Executables
Matthew G. Schultz, Eleazar Eskin, and Salvatore J. Stolfo
Computer Science Department, Columbia University
mgs,eeskin,sal @cs.columbia.edu
Abstract
The UNIX server could either wrap the malicious email
with a warning addressed to the user, or it could block the
email. All of this could be done without the server’s users
having to scan attachments themselves or having to download updates for their virus scanners. This way the system
administrator can be responsible for updating the filter instead of relying on end users.
We present Malicious Email Filter, MEF, a freely distributed malicious binary filter incorporated into Procmail
that can detect malicious Windows attachments by integrating with a UNIX mail server. The system has three capabilities: detection of known and unknown malicious attachments, automatic propagation of detection models, and the
ability to monitor the spread of malicious attachments.
The system filters malicious attachments from emails
by using detection models obtained from data-mining over
known malicious attachments. It leverages research in
data mining applied to malicious executables which allows
the detection of previously unseen, malicious attachments.
These new malicious attachments are programs that are
most likely undetectable by current virus scanners because
detection signatures for them have not yet been generated.
The system also allows for the automatic propagation of
detection models from a central server. Finally, the system
allows for monitoring and measurement of the spread of
malicious attachments. The system will be released under
GPL.
The standard approach to protecting against malicious
emails is to use a virus scanner. Commercial virus scanners can effectively detect known malicious executables,
but unfortunately they can not detect unknown malicious
executables. The reason for this is that most of these virus
scanners are signature based. For each known malicious
binary, the scanner contains a byte sequence that identifies
the malicious binary. However, an unknown malicious binary, one without a pre-existent signature, will most likely
go undetected.
We built upon research at Columbia University on datamining methods to detect malicious binaries [5]. The idea
is that by using data-mining, knowledge of known malicious executables can be generalized to detect unknown
malicious executables. Data mining methods are ideal for
this purpose because they detect patterns in large amounts
of data, such as byte code, and use these patterns to detect
future instances in similar data along with detecting known
instances. Our framework used classifiers to detect malicious executables. A classifier is a rule set, or detection
model, generated by the data mining algorithm that was
trained over a given set of training data.
1 Introduction
A serious security risk today is the propagation of malicious executables through email attachments. A malicious
executable is defined to be a program that performs a malicious function, such as compromising a system’s security,
damaging a system or obtaining sensitive information without the user’s permission. Recently there have been some
high profile incidents with malicious email attachments
such as the ILOVEYOU virus and its clones. These malicious attachments caused an incredible amount of damage
in a very short time. The Malicious Email Filter (MEF)
project provides a tool for the protection of systems against
malicious email attachments.
A freely available, open source email filter that operates from UNIX to detect malicious Windows binaries has
many advantages. Operating from a UNIX host, the email
filter could automatically filter the email each host receives.
The goal of this project is to design a freely distributed
data mining based filter which integrates with Procmail’s
pre-existent security filter [1]. This filter’s purpose is to detect both the set of known malicious binaries and a set of
previously unseen, but similar malicious binaries. The system also provides a mechanism for automatically updating
the detection models from a central server. Finally the filter reports statistics on malicious binaries that it processes
which allows for the monitoring and measuring of the propagation of malicious attachments across email hosts. The
system will be released under GPL.
1
Since the methods we describe are probabilistic, we
could also tell if a binary is borderline. A borderline binary
is a program that has similar probabilities for both classes
(i.e. could be a malicious executable or a benign program).
If it is a border-line case then there is an option in the network filter to send a copy of the malicious executable to a
central repository such as CERT. There, it can be examined
by human experts. When combined with the automatic distribution of detection models, this can reduce the time a
host is vulnerable to a malicious attachment.
The detection model generation worked as follows.
The binaries were first statically analyzed to extract bytesequences, and then the classifiers were trained over a subset of the data. This generated a detection model that was
tested on a set of previously unseen data. We implemented
a traditional, signature-based algorithm to compare their
performance with the data mining algorithms. Using standard statistical cross-validation techniques, this framework
had a detection rate of 97.76%, over double the detection
rate of a signature-based scanner.
The organization of the paper is as follows. We first
present the how the system works and how it is integrated
with Procmail. Secondly, we describe how the detection
models are updated. Then we describe how the system
can be used for monitoring the propagation of email attachments. At the end, we summarize the research in data
mining for creating detection models that can generalize to
new and unknown malicious binaries, and report the accuracy and performance of the system.
can then be analyzed by experts to determine whether they
are malicious or not, and subsequently included in the future generation of detection models.
A simple metric to detect borderline cases and redirect
them to an evaluation party would be to define a borderline case to be a case where the probability it is malicious
is above some threshold. There is a tradeoff between the
detection rate of the filter and the false positive rate. The
detection rate is the percentage of malicious email attachments detected, while the false positive rate is the percentage of the normal attachments labeled as malicious. This
threshold can be set based on the policies of the host.
Receiving borderline cases and updating the detection
model is an important aspect of the data mining approach.
The larger the data set used to generate models the more
accurate the detection models are because borderline cases
are executables that could potentially lower the detection
and accuracy rates by being misclassified.
2.2
After a number of borderline cases have been received, it
may be necessary to generate a new detection model. This
would be accomplished by retraining the data mining algorithm over the data set containing the borderline cases
along with their correct classification. This would generate a new detection model that when combined with the
old detection model would detect a larger set of malicious
binaries.
The system provides a mechanism for automatically
combining detection models. Because of this we can only
the portions of the models that have changed as updates.
This is important because the detection models can get very
large. However, we want to update the model with each
new malicious attachment discovered. In order to avoid
constantly sending a large model to the filters, we can just
send the information about the new attachment and have it
integrated with the original model.
Combining detection models is possible because the underlying representation of the models is probabilistic. To
generate a detection model, the algorithm counted of times
that the byte string appeared in a malicious program versus
the number of times that it appeared in a benign program.
From these counts the algorithm computes the probability
that an attachment is malicious in a method described later
in the paper. In order to combine the models, we simply
need to sum the counts of the old model with the new information.
As shown in Figure 1, in Model A, the old detection
model, a byte string occurred 99 times in the malicious
class and 1 time in the benign class. In Model B, the update model, the same byte string was found 3 times in the
malicious class and 4 times in the benign class. The combination of models A and B would state that the byte string
occurred 102 times in the malicious class and 5 times in
2 Incorporation into Procmail
MEF filters malicious attachments by replacing the standard virus filter found in Procmail with a data mining generated detection model. Procmail is invoked by the mail
server to extract the attachment. Currently the mail server
supported is sendmail.
This filter first decodes the binary and then examines the
binary using a data mining based model. The filter evaluates it by comparing it to all the byte strings found with
it to the byte-sequences contained in the detection model.
A probability of the binary being malicious is calculated,
and if it is greater that its likelihood of being benign then
the executable is labeled malicious. Otherwise the binary
is labeled benign.
Depending on the result of the filter, Procmail is used to
either pass the mail untouched if the attachment is determined to be not malicious or warn the recipient of the mail
that the attachment is malicious.
2.1
Update Algorithms
Borderline Cases
Borderline binaries are binaries that have similar probabilities of being benign and malicious (e.g. 50% chance it is
malicious, and 50% chance it is benign). These binaries
2
– Model A –
The byte string occurred in 99 malicious executables
The byte string occurred in 1 benign executable
– Model B –
The byte string occurred in 3 malicious executables
The byte string occurred in 4 benign executables
– Combined Model –
The byte string occurred in 102 malicious executables
The byte string occurred in 5 benign executable
4 Methodology for Building Data Mining
Detection Models
We gathered a large set of programs from public sources
and separated the problem into two classes: malicious and
benign executables. Each example in the data set is a Windows or MS-DOS format executable, although the framework we present is applicable to other formats. To standardize our data-set, we used MacAfee’s [4] virus scanner
and labeled our programs as either malicious or benign executables.
We split the dataset into two subsets: the training set and
the test set. The data mining algorithms used the training
set while generating the rule sets, and after training we used
a test set to test the accuracy of the classifiers over unseen
examples.
Figure 1: Sample Combined Model resulting from applying the update model, B, to the old model, A.
the benign class. The combination of A and B would be
the new detection model after an update.
The mechanism for propagating the detection models is
through encrypted email. The Procmail filter can receive
the detection model through a secure email sent to the
mail server, and then automatically update the model. This
email will be addressed and formatted in such a way that
the Procmail filter will process it and update the detection
models.
4.1
Data Set
The data set consisted of a total of 4,301 programs split
into 3,301 malicious binaries and 1,000 clean programs.
The malicious binaries consisted of viruses, Trojans, and
cracker/network tools. There were no duplicate programs
in the data set and every example in the set is labeled either
malicious or benign by the commercial virus scanner. All
labels are assumed to be correct.
All programs were gathered either from FTP sites,
or personal computers in the Data Mining Lab here at
Columbia University. The entire data set is available off
our public ftp site ftp://ftp.cs.columbia.edu/pub/mgs28.
3 Monitoring the Propagation of Email Attachments
Tracking the propagation of email attachments would be
beneficial in identifying the origin of malicious executables, and in estimating a malicious attachments prevalence.
The monitoring is done by having each host that is using
the system log the malicious attachments, and the borderline attachments that are sent to and from the host. This
logging may or may not contain who the sender or receiver
of the mail was depending on the privacy policy of the host.
In order to log the attachments, we need a way to obtain
an identifier for each attachment. We do this by computing a hash function over the byte sequences in the binary
obtaining a unique identifier.
The logs of malicious attachments are sent back to the
central server. Since there is a unique identifier for each
binary, we can measure the propagation of the malicious
binaries across hosts by examining their logs. From these
logs we can estimate how many copies of each malicious
binary are circulating the Internet.
The current method for detailing the propagation of malicious executables is for an administrator to report an attack to an agency such as WildList [9]. The wild list is a
list of the propagation of viruses in the wild and a list of
the most prevalent viruses. This is not done automatically,
but instead is based upon a report issued by an attacked
host. Our method would reliably, and automatically detail
a malicious executable’s spread over the Internet.
4.2
Detection Algorithms
We statically extracted byte sequence features from each
binary example for the algorithms to use to generate their
detection models. Features in a data mining framework are
properties extracted from each example in the data set, such
as byte sequences, that a classifier uses to generate detection models. These features were then used by the algorithms to generate detection models. We used hexdump [7],
an open source tool that transforms binary files into hexadecimal files. After we generated the hexdumps we had
features in the form of Figure 2 where each line represents
a short sequence of machine code instructions.
646e 776f 2e73 0a0d 0024 0000 0000 0000
454e 3c05 026c 0009 0000 0000 0302 0004
0400 2800 3924 0001 0000 0004 0004 0006
000c 0040 0060 021e 0238 0244 02f5 0000
0001 0004 0000 0802 0032 1304 0000 030a
Figure 2: Example Set of Byte Sequence Features
3
4.3
Signature-Based Approach
4.5
To quantitatively express the performance of our method
we show tables with the counts for true positives, true negatives, false positives, and false negatives. A true positive, TP, is an malicious example that is correctly tagged
as malicious, and a true negative, TN, is a benign example
that is correctly classified. A false positive, FP, is a benign program that has been mislabeled by an algorithm as
a malicious program, while a false negative, FN, is a malicious executable that has been mis-classified as a benign
program.
We estimated our results over new executables by using
5-fold cross validation [3]. Cross validation is the standard
method to estimate likely predictions over unseen data in
Data Mining. For each set of binary profiles we partitioned
the data into 5 equal size partitions. We used 4 of the partitions for training and then evaluated the rule set over the
remaining partition. Then we repeated the process 5 times
leaving out a different partition for testing each time. This
gave us a reliable measure of our method’s accuracy over
unseen data. We averaged the results of these five tests to
obtain a good measure of how the algorithm performs over
the entire set.
To compare our results with traditional methods we implemented a signature based method. First, we calculated the
byte-sequences that were only found in the malicious executable class. These byte-sequences were then concatenated together to make a unique signature for each malicious executable example. Thus each malicious executable
signature contained only byte-sequences found in the malicious executable class. To make the signature unique, the
byte-sequences found in each example were concatenated
together to form one signature. This was done because
a byte-sequence that was only found in one class during
training could possibly be found in the other class during
testing [2], and lead to false positives when deployed.
Since the virus scanner that we used to label the data set
had been trained over every example in our data set, it was
necessary to implement a similar signature-based method
to compare with the data mining algorithms. There was no
way to use an off-the-shelf virus scanner, and simulate the
detection of new malicious executables because these commercial scanners contained signatures for all the malicious
executables in our data set. In our tests the signature-based
algorithm was only allowed to generate signatures over the
set of training data to compare them to data mining based
techniques. This allowed our data mining framework to be
fairly compared to traditional scanners over new data.
4.4
Preliminary Results
4.6
New Executables
To evaluate the algorithms over new executables, the algorithms generated their detection models over the set of
training data and then tested their models over the set of
test data. This was done five times in accordance with five
fold cross validation.
Shown in Table 1, the data mining algorithm had the
highest detection rate of either method we tested, 97.76%
compared with the signature based method’s detection rate
of 33.96%. Along with the higher detection rate the data
mining method had a higher overall accuracy, 96.88% vs.
49.31%. The false positive rate at 6.01% though was higher
than the signature based method, 3.80%.
For the algorithms we plotted the detection rate vs. false
positive rate using ROC curves [10]. ROC (Receiver Operating Characteristic) curves are a way of visualizing the
trade-offs between detection and false positive rates. In
this instance, the ROC curves in Figure 3 show how the
data mining method can be configured for different applications. For a false positive rate less than or equal to 1%
the detection rate would be greater than 70%, and for a
false positive rate greater than 8% the detection rate would
be greater than 99%.
Data Mining Approach
The classifier we incorporated into Procmail was a Naive
Bayes classifier [6]. A naive Bayes classifier computes the
likelihood that a program is malicious given the features
that are contained in the program. We assumed that there
were similar byte sequences in malicious executables that
differentiated them from benign programs, and the class of
benign programs had similar sequences that differentiated
them from the malicious executables.
We used the Naive Bayes method to compute the probability of an executable being benign or malicious. The
Naive Bayes method, however, required more than 1 GB
of RAM to generate its detection model. To make the algorithm more efficient we divided the problem into smaller
pieces that would fit in memory and trained a Naive Bayes
algorithm over each of the subproblems. The subproblem
was to classify based on every 6th line of machine code
instead of every line in the binary. For this we trained six
Naive Bayes classifiers so that every byte-sequence line in
the training set had been trained over. We then used a voting algorithm that combined the outputs from the six Naive
Bayes methods. The voting algorithm was then used to
generate the detection model.
4.7
Known Executables
To evaluate the performance of the algorithms over known
executables the algorithms generated detection models over
the entire set of data and then their performance was evalu4
Profile
Type
Signature Method
Data Mining Method
True
True
False
Positives (TP) Negatives (TN) Positives (FP)
1121
1000
0
3191
940
61
False
Negatives (FN)
2180
74
Detection False Positive Overall
Rate
Rate
Accuracy
33.96%
0%
49.31%
97.76%
6.01%
96.88%
Table 1: These are the results of classifying new malicious programs organized by algorithm and feature. Note the Data
Mining Method had a higher detection rate and accuracy while the Signature based method had the lowest false positive
rate.
Profile
Type
Signature Method
Data Mining Method
True
True
False
Positives (TP) Negatives (TN) Positives (FP)
3301
1000
0
3301
1000
0
False
Negatives (FN)
0
0
Detection False Positive Overall
Rate
Rate
Accuracy
100%
0%
100%
100%
0%
100%
Table 2: Results of classifying known malicious programs organized by algorithm and feature. The Signature based method
and the data mining method had the same results.
5 Data Mining Performance
100
90
The system required different time and space complexities
for model generation and deployment.
80
Detection Rate
70
60
50
5.1
40
In order for the data mining algorithm to quickly generate
the models, it required all calculations to be done in memory. Using the algorithm took in excess of a gigabyte of
RAM. By splitting the data into smaller pieces, the algorithm could be done in memory with a small loss ( 1.5%)
in accuracy and detection rate. The calculations could then
be done on a machine with less than 1 GB of RAM. This
splitting allowed the model generation to be performed in
less than two hours.
30
20
Data Mining
Signature Based
10
0
0
1
2
3
4
5
6
False Positive Rate
7
8
Training
9
Figure 3: Data Mining ROC. Note that the Data Mining method has a higher detection rate than the signature
method with a greater than 0.5% false positive rate.
5.2
ated by testing over the same examples they had seen during training.
As shown in Table 2, over known executables the methods had the same performance. Each had an accuracy of
100% over the entire data set - correctly classifying all the
malicious examples as malicious and all the benign examples as benign.
The data mining algorithm detected 100% of known executables because it was merged with the signature based
method. Without merging, the data mining algorithm detected 99.87% of the malicious examples and misclassified 2% of the benign binaries as malicious. However,
we have the signatures for the binaries that the data mining algorithm misclassified, and the algorithm can include
those signatures in the detection model without lowering
accuracy over unknown binaries. After the signatures for
the executables that were misclassified during training had
been generated and included in the detection model, the
data mining model had a 100% accuracy rate over known
executables.
During Deployment
Online evaluation of executables can take place with much
less memory required. The models could be stored in
smaller pieces stored on the file system that would facilitate quicker analysis than having only one large model that
needed to be loaded into memory for evaluation of each binary. Also by not loading the model into memory during
evaluation we also avoid the problem of computers with
small amounts of memory ( 128 MB), and taking memory away from the other processes running on the server.
Since the algorithm analyzes each line of byte code contained in the binary, the time required for each evaluation
varies. However, this can be done efficiently.
6 Conclusions
The first contribution that we presented in this paper was
a freely distributed filter for Procmail that detected known
malicious Windows executables and previously unknown
malicious Windows binaries from UNIX. The detection
5
rate of new executables was over twice that of the traditional signature based methods, 97.76% compared with
33.96%.
In addition the system we presented has the capability
of automatically receiving updated detection models and
the ability to monitor the propagation of malicious attachments.
One problem with traditional, signature-based methods
is that in order to detect a new malicious executable, the
program needs to be examined and a signature extracted
from it and included in the anti-virus database. The difficulty with this method is that during the time required for a
malicious program to be identified, analyzed and signatures
to be distributed, systems are at risk from that program.
Our methods may provide a defense during that time. With
a low false positive rate the inconvenience to the end user
would be minimal while providing ample defense during
the time before an update of models is available.
Virus Scanners are updated about every month, and 240–
300 new malicious executables are created in that time (8–
10 a day [8]). Our method would catch roughly 216–270
of those new malicious executables without the need for an
update whereas traditional methods would catch only 87–
109. Our method more than doubles the detection rate of
signature based methods.
John F. Morar. Anatomy of a Commercial-Grade Immune System, IBM Research White Paper, 1999.
http://www.av.ibm.com/ScientificPapers/White/
Anatomy/anatomy.html
[9] Wildlist Organization. Virus description of viruses
in the wild. Online Publication, 2000. http://www.fsecure.com/virus-info/wild.html
[10] Zou KH, Hall WJ, and Shapiro D. Smooth nonparametric ROC curves for continuous diagnostic tests,
Statistics in Medicine, 1997
References
[1] John
Hardin,
Enhancing
E-Mail
Security
With
Procmail,
Online
publication,
2000.
http://www.impsec.org/email-tools/procmailsecurity.html
[2] Jeffrey O. Kephart, and William C. Arnold. Automatic
Extraction of Computer Virus Signatures, 4th Virus Bulletin International Conference, pages 178-184, 1994.
[3] Kohavi, R. A study of cross-validation and bootstrap
for accuracy estimation and model selection, IJCAI,
1995.
[4] MacAfee. Homepage - MacAfee.com, Online publication, 2000. http://www.mcafee.com
[5] MEF Group, Malicious Email Filter Group, Online
publication, 2000. http://www.cs.columbia.edu/ids/mef/
[6] D.Michie, D.J.Spiegelhalter, and C.C.Taylor. Machine
learning of rules and trees. In Machine Learning, Neural
and Statistical Classification, Ellis Horwood, New York,
pages 50-83,1994.
[7] Peter Miller. Hexdump, Online publication, 2000.
http://www.pcug.org.au/ millerp/hexdump.html
[8] Steve R. White, Morton Swimmer, Edward J.
Pring, William C. Arnold, David M. Chess, and
6
Download