Uploaded by robertabbas

Statistical Network Behavior Based Threat Detection

advertisement
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
Systems and Networks
Statistical Network Behavior Based
Threat Detection
Jin Cao, Lawrence Drabeck and Ran He*
Nokia Bell Labs, Murray Hill
Email: {jin.cao, lawrence.drabeck, ran.he}@on.nokia-bell-labs.com
Abstract—Malware, short for malicious software, contuses to
morph and change. Traditional anti-virus software may have
problems detecting malicious software that have not been seen
before. By employing machine learning techniques, one can learn
the general behavior patterns of different threat types and use
these to detect variants of unknown threats. We have developed
a malware detection system based on machine learning that uses
features derived from a user’s network flows to external hosts. A
novel aspect of our technique is to separate hosts into different
groups by how common they are visited by the users and then
develop user features separately for each of these host groups.
The network data for the training of the detector is based on
malware samples that have been run in a sandbox and normal
users’ traffic that is collected from an LTE wireless network
provider. Specifically, we use the Adaboost algorithm as the
classification engine and obtain a good performance of 0.78%
false alarm rate and 96.5% accuracy for detecting users infected
with malwares. We also provide high and low confidence regions
for our system based on subclasses of threats.
I. I NTRODUCTION
Internet security has become increasingly important as new
and varied threats seem to proliferate at a fast pace. Many
methods on threat detection rely on exact fingerprints of
known malwares. For example, anti-virus software looks for
file signatures and Intrusion Detection System (IDS) uses the
network traffic signatures of threats such as specific character
strings in the contents of their HTTP GET/POST. These
methods enjoy a high precision rate, but are unable to detect
unknown threats. Therefore, there is an intense interest in
developing methods that are effective to detect those unseen
threats in the outbreak, where the fingerprints are unknown in
advance.
Towards this goal, and to avoid the costly inspection
of packet contents at high network speed, researchers have
focused on flow-based approaches where they analyze the
network flows instead of individual packets (see [1] for a
comprehensive survey.) Furthermore, due to recent developments of machine learning algorithms and big data analytics,
researchers have been increasingly relying on machine learning techniques in detecting malwares from the vast amount of
network data. However, these methods often target a specific
malware type such as Scans, Worms, Botnets and Denial of
Services attacks ([2], [3], [4]), and typically focus on threat
behavior patterns in a specific network application such as
HTTP or DNS. In summary, research on a general purpose
threat detection scheme is lacking.
* alphabetical author list
978-1-5386-2784-6/17/$31.00 ©2017 IEEE
The objective of our study is to develop an effective system
sitting on a network perimeter that can be used to detect
and alert the end users who are infected by a wide class
of known/unknown threats. Our detection is done from the
perspective of the individual end user, where the classification
is done by examining the network traffic patterns between this
end user and all the network hosts it communicates with. We
contrast this with network based approaches where traffic from
all end users is first aggregated and then analyzed (e.g. [5]).
As our threat detection system is designed to be generalpurpose, we intentionally downplay the use of domain knowledge in our study and instead focus heavily on using machine learning to derive insights automatically. This focus is
evident in different stages from exploratory data analysis to
feature extraction to statistical model building and evaluation.
Specifically, we analyze the network traffic from an infected
user and compare it with the network traffic from a typical
normal user’s device. We learn the malicious behavior patterns
of a wide class of malwares and design a feature set that
can capture these patterns. Based on these features, predictive
models are built and used to detect an unknown malware.
A main difficulty in our task, especially in the feature design, is that the traffic from an infected user is in fact a mixture
of malicious and normal activities. Hence for the infected user,
even though its malware activity can be distinguished easily
from its normal activity, detecting malware activity from the
mixture is much harder, as the statistical features from the
malware may get lost in the mixture. To address this challenge,
we introduce the concept of host slicing, where traffic from an
individual user to external hosts is first separated (or sliced)
by these host slices and features are extracted for these slices
separately. If there is a good correlation between the malware
activities and the host slices, such separation will make the
detection much easier.
In particular, through exploratory analysis we have found
that the malware activities tend to engage unpopular external
hosts much more than the normal activities do. For example,
non-threat users are found to visit unpopular hosts 3% of the
time whereas the threat users visit unpopular hosts 40% of
the time. Based on this observation, we design two host slices.
The first slice, which we refer to as non-rare hosts, contains all
hosts that are visited somewhat frequently by a normal user.
The second slice, which we refer to as rare hosts, contains
the rest of the hosts. For an individual user, traffic between
the user and these host slices are first split apart, and then
features are computed separately for each host slice.
420
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
2
Systems and Networks
The following are our key contributions:
• We develop a general purpose threat detection system for
a wide class, not just specific types of threats, from the
perspective of individual end users.
• We establish malwares’ network behavior characteristics
by contrasting them with normal network activities from
ordinary users and develop features to charcterize these
behaivor. It is shown later that host and time pattern based
features are among the most important features in our
built system.
• We classify infected users under a mixed environment
where the network traffic from an infected user is a
mixture of threat and normal activities. We achieve this
by dividing traffic between each user and its (external)
hosts into two hosts slices: traffic to rare hosts and traffic
to non-rare hosts, as we have shown that threat activities
engage rare hosts much more so than normal activities.
• In the mixed environment, we can achieve 96.5% accuracy for the detection of malwares while maintaining a
0.78% false positive rate.
• For a sub-population of threats covering 70% of all
malware samples with predefined behavior traits, the
detection accuracy is 98.3%, better than the overall 96.5%
detection rate.
The overall design of our system consists of the following
key steps: 1) an offline collection of threat and non-threat data
as training sources; 2) an offline feature design and model
tuning using traditional supervised learning process; 3) a near
real time implementation of the model as the threat detector.
Finally we comment that as our approach utilizes hostname
specific features, in case of encrypted traffic, our method
would not be affected as long as we can still accurately retrieve
the hostnames.
II. DATA
The data used for our analysis and modeling are packet capture files (PCAP) that record the network traffic between the
users and the network. The captured pcap files are processed
to obtain data for later feature calculations. We extract two
types of data from the pcap files. The first type, which we
refer to as Basic Flow data (basicflow), is the tradictional IP
flow defined as a 5-tuple of source IP and port, destination IP
and port and protocol, and records the number of packets and
bytes on a per flow basis for each direction. For several key
applications, namely, DNS, HTTP, SMTP, FTP and ICMP, we
also extracted Application Flow data (appflow) which gives
details on the application level information. For example, for
DNS, we would have DNS request and response code, the
resolved DNS names, etc. For appflows, whenever possible,
we also resolve the hostname from the application header.
A. Threat Data
Threat data is collected from our Motive Security Lab
covering a wide spectrum of known threats. The pcap files
were obtained by running the malicious code in a sandbox
and recording the network traffic. Since the original aim of
the sandbox study was to verify the signature of the threat, the
threat was run only until the signature was verified. Therefore,
many of the threat pcap files are fairly short in duration
and small in the number of appflows. To have sufficient
information on the threat behavior patterns for later feature
calculations, we limit the threats we use in the study to have
a minimum number of appflows (≥20).
We use two different data sets of threat data captured by
2015.
1) Orignal 1700 Threats: This data set is comprised of
1713 different threats with one pcap file per threat. After
limiting the threats to have a minimum of 20 appflows, this
set is reduced to 503 threats and is comprised of 333 Win32,
165 Android and 5 other threats. The main Win32 threats are
26% Trojans, 17% Downloaders, 15% Adware, 9% Backdoor
and some Virus, Worms, Bots, Spyware, and Password Stealer
threats. The Android threats are 41% Trojans, 25% Mobile
Spyware, 15% Backdoor, 7% Adware and some Bots, Downloader and Spyware threats.
2) 0.5TB Threats: This set is 0.5 TB in size and has 1002
different threats with 2 to 2000 different pcap captures per
threat (381,828 total pcap files). The different pcaps files per
threat have different MD5 signatures but are in the same class
of threat. After filtering these threats for a minimum of 20
appflows and < 600 sec duration (this matches our non-threat
data length) we have 78,719 different pcaps files from 742
different threats.
B. Non-threat Data
Non-threat data, which represents data from normal users,
was obtained from a network tap in a wireless network
providers LTE network in February 2016. The tap measures
the S1u (eNB to SGW) and S11 (MME to SGW) network
links but we only use the S1u traffic for this study. The
market we capture data for contains around 200 eNodeBs
and around 600 sectors and the data captured encompasses
70 minutes of time (450 GB of pcap files). After cleaning the
data, matching users to their assigned IPv4 and IPv6 address
and filtering, we obtain 56K user records. We only use a
randomly sampled representative set of 8.5K users’ records
(2.6M appflow records) for the study.
C. Synthetic Generation of Infected Users
The data we have as described above are either malware
or normal traffic (assuming that our collected mobile user
data do not contain any threats). However, traffic from an
infected user will contain a mix of malware traffic and normal
traffic. As our end goal is to build a detector of infected
users, we need to have another collection of network traffic
from infected users. In the following, we discuss our approach
of generating synthetic traffic of the infected users using the
existing malware and normal user data.
To put in a simple way, what we do is a simple mixing of the
two traffic sources. The mixing is accomplished by choosing
pairs of a user and a threat randomly from our collection and
then interleaving the data. The start time of the malware traffic
in the mix is random so as not to bias the learning algorithm
to always look for the threat at the beginning of the data.
421
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
3
Systems and Networks
80
60
40
Percentage
0
20
80
60
40
20
0
Percentage
100
Non−threat
100
Threat
0.0
0.2
0.4
0.6
Fraction
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fraction
Fig. 1. Fraction of appflows to popular hosts that are visited by threats and
normal users.
We create two sets of synthetic infected users using the two
different threat datasets, and study the generalization of using
one set for model building and using the other set for testing.
Note that the 70 minute duration of normal traffic is longer
than all of the above threat datasets, implying that threat data
will be completely imbedded in the normal traffic without any
overrun of time when we create our synthetic users infected
with malwares.
To mimic a real time application scenario where our system
is used to detect potential malwares from the user traffic
in a time window continuously, we apply a fixed stop time
for mixed traffic when we perform the mixing. That is, we
discard all traffic after the pre-specified stop time. In our
implementation, we set the stop time to be 10 minutes.
III. E XPLORATORY DATA A NALYSIS
A key step in our study is to understand how the malware
network traffic from an infected user is different from its
normal behavior. We investigate this from various angles
such as time dynamics, interactions between the network
applications and the communicating hosts.
We extract a few key observations from the data exploratory
analysis that have been widely explored in the literature. These
observations inspire us on the feature designs as well as our
innovative design of host slices, which is one of the key
contributions of this paper.
Observation 1. Many malwares exhibit periodic traffic patterns, even though their manner in which the periodicity
manifests could differ significantly. Sometimes the periodicity
is to individual hosts and sometimes to a group of hosts
collectively.
Observation 2. Malwares are more likely to produce failure
events such as the high percentage of DNS failures.
Observation 3. Malware traffic and normal traffic tend to
show different time dynamics, and interactions between hosts
and application protocols tend to be different.
Observation 4. Normal non-threat traffic from actual users
tend to go to popular websites more often than malwares do.
Figure 1 show the percentage of appflows that go to popular
hosts for both normal users and malwares. The popular hosts
are defined by the top 10,000 sites in the Alexa 1M top internet
sites list. As it can be seen, for more than 80% of the threats,
their fraction of appflows to popular sites is less than 10%.
On the contrary, normal users have a much larger fraction of
Fig. 2. Overall strategy of splitting network traffic to two slices for rare and
non-rare hosts.
the appflows going to popular sites. This is intuitive as many
human activities are common: for example we visit search
engines for information, go to news websites for latest news,
or visit social network sites for social activities, or do online
shopping. On the contrary, malwares do not mimic human
behaviors and are more likely to visit some obscure websites.
Of course, this is not to say the malware do not visit popular
websites at all as some of them even use popular sites (like
Facebook) for command and control, but the majority of traffic
do not mimic those resulted from human actions.
IV. F EATURE E NGINEERING
Although the statistical behaviors between threat and nonthreat traffic are quite distinct as we have shown earlier, we
face significant challenges in differentiating between the traffic
from an infected user and that from a normal user. This is
because, the former will be a mixture of threat and non-threat
activities. In this case, statistical features from the threat traffic
may get lost in the mixture. For example, if threat traffic is
periodic, when it mixes with the user’s normal traffic, the
mixture would not have periodic patterns. Therefore, features
computed on the entire traffic mixture may not be effective
for our purpose.
To address the above challenge in the mixed environment,
our approach is to create host slices that are designed to make
the distinction between mixed traffic and normal traffic much
clearer. For an individual user, specifically, traffic between the
user and these host slices are first split apart, and then features
are extracted separately for each host slice.
A. Two Host Slices: Rare Hosts and Non-Rare Hosts
There are many ways to define host slices, such as host
clusters that have similar characteristics via clustering analysis.
However, in this paper, we employ a simple way to create host
slices in order to illustrate the effectiveness of our design but it
can be definitely generalized to other host slices. Specifically,
we create two host slices: one slice, which we refer to as
non-rare hosts, contains all hosts that are visited somewhat
frequently for a normal user; the other slice, which we refer
to as rare hosts are the rest of the hosts. This is inspired by
the observation in the exploratory analysis that threat traffic
often engages unpopular hosts (Figure 1). Our general strategy
is illustrated in Figure 2.
To be more specific, we define a rare host as an infrequently
visited host by normal users that comes from an unpopular
domain, where domain refers to the 2nd level domain (e.g.
422
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
4
Systems and Networks
TABLE I
AVERAGE NUMBER
Type
Threat
Normal user
OF APPFLOWS PER USER TO RARE / NON - RARE HOSTS .
Non-rare hosts
192
188
Rare hosts
129
5.4
(40% are rare)
(2.7% are rare)
google.com) and host refers to the complete hostname or URI
(e.g. googlehosted.l.googleusercontent.com). We require the
rare host to come from an unpopular domain to accommodate
personalized service advertisement from popular domains such
as googleusercontent.com. We define infrequently visited host
and unpopular domain using both the Alexa 1M popular
domain list and our non-threat LTE cellular data. Precisely,
a rare host is defined as having both of these properties:
1) infrequently visited, where the number of visited users
from our non-threat LTE data ≤ u1
2) from an unpopular domain, where an unpopular domain
is defined by i) the number of visited users ≤ u2 and ii)
not in the Alexa top 1M domains.
In our experiment, we use u1 = 1 and u2 = 5. Table I
contains summarizing statistics of the set of non-rare and rare
hosts in our datasets. The difference in behavior between the
threat and non-threat data indicates the possibility of using this
host grouping to unmix the non-threat and threat traffic.
B. Features Characterizing Traffic Behavior Patterns for each
Host Slice
As shown in Figure 2, for each user we split its flow data
into two buckets, one for traffic to non-rare hosts and the
other for traffic to rare hosts. For each bucket, we derive user
features to characterize the statistical behavior of the network
traffic from the individual user to the set of external hosts.
These features form largely 6 categories, with each category
designed to capture one aspect of the network behavior.
1) Basic statistics: Basic statistics of the user traffic, such
as the number of appflows and the percentage of uplink
appflows. These basic statistics capture the general information
of the network behavior from a user device.
2) DNS-based features: These features allow us to characterize different properties of DNS appflows and include, for
example, the percentage of DNS failures and the ratio of DNS
responses to requests. Some of the features are adopted from
[5].
3) HTTP/HTTPS-based features: These features are designed for characterizing HTTP/HTTPs appflows as well as
the strings contained in HTTP URIs. These features include,
for example, the percentage of HTTP failures and the median
length of the URIs. Some of these features were suggested by
[4]. In fact, signature based threat detection relies heavily on
HTTP/HTTPS signatures. We examined 2, 515 SNORT rules
used to detect our 1, 713 threats and found that 1, 502 threats
use very specific signatures in the HTTP/HTTPS traffic.
4) Hostname-based features: We do not use the exact
names of hosts as features because they do not generalize
well to new unobserved hosts. Our hostname based features
are designed to extract general information from host names.
which can be used for prediction. For example, the number of
characters in host names, the length of longest non-meaningful
strings in host names and the percentage of numeric characters.
Some of these metrics are inspired by [2].
5) Time pattern-based features: Based on our earlier observations (e.g. Observation 1 and 3 in Section III), threats
and non-threats show very different time patterns when they
communicate to external hosts. Therefore features computed
based on time patterns are very important. Examples of
these features are the median of inter-arrival times between
consecutive appflows and the number of hosts to whom the
user’s visiting pattern that shows periodicity. Other examples
include the number of hosts that have bursty or short-lived
visiting patterns, where the bursty patterns are common when
a malware is activated and generates a sudden increase in the
volume of appflows, while transient patterns are most seen
in normal users’ activities such as browsing a website. As we
will show in later evaluations, these time pattern-based features
are in fact very effective in distinguishing threats from normal
users.
6) Basicflow-based features: Appflows only target specific
applications, while basicflows applies to all applications/ports.
This set of features are statistics extracted from basicflows
directly. For example, the number of unidirectional flows
(flows that have only and uplink or downlink but not both).
In total we developed a list of 54 features that capture the
statistical network traffic behavior from an individual user to
a set of external hosts. Since we divide the external hosts
into two buckets, rare hosts and non-rare hosts, we have a
total of 54×2=108 features for both traffic buckets altogether.
Additionally, we also design a set of 28 features that captures
the overall traffic pattern, such as the percentage of overall
appflows to hosts in the rare bucket. In total, our threat
detection system contains a list of 136 features.
V. M ACHINE L EARNING A LGORITHMS
Our primary goal in an unknown threat scenario is to
learn from observed data via building a model and to predict
whether new data contains infected traffic or not. This is
considered as a classification problem in the terminology of
machine learning, i.e., identifying to which of two categories
(infected by malwares or not) a new user belongs, on the
basis of a training set of data containing observations whose
category memberships are known.
Algorithms: We have evaluated four standard machine
learning techniques: Naive Bayes, Classification Trees, Random Forest and Boosting Trees. Two primary factors for
the choice of the learning algorithms are: 1) accuracy and
2) treatment of missing data. The reason why the latter is
important is because there are many missing values due to
our feature design. For example, for a normal user if we
only observe traffic from non-rare hosts then many features
related to the rare traffic are missing. We evaluate these
four algorithms by cross-validation and pick Boosting Trees,
or more specifically, AdaBoost Tree, as it can achieve high
accuracies and handle missing data using surrogate variables.
Moreover, AdaBoost tree enjoys the benefit of being less
susceptible to the overfitting problem.
423
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
5
Systems and Networks
Parameter Tuning: To get better performance from the
AdaBoost Tree classifier, parameter tuning is necessary. There
are two regularizing parameters for the model, a Shrinkage
parameter ν, or learning rate, which scales the contribution of
each tree by a factor of ν. The other tuning parameter is the
number of iteration M , which control the complexity of the
final ensemble trees. A small value of ν typically requires large
value of M , in order to achieve the comparable performance.
In practice, a small value of ν with a large value of M
are used to have a better prediction performance. We tuned
these two parameter with the help of grid search and Kfold cross-validation. Grid search is to divide the parameter
space (2 dimensional since we have two parameters to be
tuned) into small grids, each one corresponding to a pair
of (ν, M ) value, and try out all these pairs to find the one
that optimizes the testing performance. The performance is
evaluated by averaging the prediction results, such as testing
misclassification rate, on all these K folds in K rotations.
Performance Metrics: Throughout the study, we define
Non-threat instances as “Negative” examples and Threat instances as “Positive” examples. Our evaluation are primarily
based on two metrics associated with misclassification rates,
False Negative Rate (FNR) and False Positive Rate (FPR).
VI. E VALUATION
As explained in Section II-C, we generate synthetic infected users by mixing the threat dataset with normal users’
activity. In this section, we evaluate the performance of our
threat detection system using models built from this data and
highlight some discoveries. Our main data for evaluation uses
the original 1700 threats (Section II-A1) in the mixing and it
contains 1,995 “infected” users and 8,517 normal users as the
control set. We later in SectionVI-C study the generalization
performance of this system on a different mixed data using
the 0.5TB threat data (Section II-A2).
A. Results and Training Weight
We have chosen the AdaBoost Tree classifier for our system
due to its advantages mentioned in Section V. Its parameters are tuned via grid search and cross-validation and are
optimized to be (ν, M ) = (0.4, 800). Then the following
experiment is performed using mixed dataset combined with
control set:
1) Compute all 136 features as in Section IV for each user
in these two datasets. Add to the feature list a label of
threat (mixed) or non-threat.
2) Randomly select 2/3 of the threats and 2/3 of the nonthreat users from the feature matrix as the training
dataset. The remaining 1/3 are used as testing data.
3) Fit an AdaBoost classifier to the training data and test
on the testing data. Note that the training weight, i.e.,
normal-to-mix ratio can be varied in order to control the
tradeoff between FPR and FNR.
Results are summarized in Table II. With the original
normal-to-mix ratio of data itself (8517/1995=4.3:1), we obtain a False Positive Rate of 0.17% (mistakenly classify nonthreat as threat) and a False Negative Rate of 3.76% (mistakenly classify threat as non-threat). But in practise, different use
TABLE II
T ESTING RESULT FOR
OUR SYSTEM WITH DIFFERENT NORMAL - TO - MIX
TRAINING WEIGHTS .
Normal-to-mix ratio of 4.3:1
Predicted Non-threat Predicted Threat Error Rate
Observed Non-threat
2834
5
0.176%
Observed Threat
25
640
3.76%
Normal-to-mix ratio of 2.2:1
Predicted Non-threat Predicted Threat Error Rate
Observed Non-threat
2817
22
0.78%
Observed Threat
23
642
3.46%
cases may have different preferences on the tradeoff between
FPR and FNR, and controlling normal-to-mix ratio is able to
achieve this goal. For example, with a normal-to-mix ratio of
2.2:1, the FNR is reduced to 3.46%, though the FPR slightly
increases to 0.78%.
B. Feature Reduction
We also investigate the results of our threat detection system
with a reduced number of features to better understand the
feature importance and minimize the computational costs of
our system. Our first attempt is to pick 6 meaningful feature
sets, each one corresponding to one aspect of features. We
will also show that the feedback from the Adaboost model
can help select a more informative set of features. Our first 6
reduced sets of features are:
• Case 1 - DNS-based features (20 features)
• Case 2 - HTTP-based features (28 features)
• Case 3 - Time/host-based features (62 features): features derived from the network traffic time dynamics
(such as burstiness, periodicity, short-liveness) and from
hostnames (e.g. the number of characters), irrespective of
the application.
• Case 4 - Failure-based features (16 features): features
related to DNS failures or HTTP failures.
• Case 5 - Rare host features (58 features): features
derived from traffic to rare hosts, i.e., the features falling
into rare bucket, as explained in Figure 2.
• Case 6 - Non-rare host features (54 features): features
derived from traffic to non-rare hosts, i.e., the features in
non-rare bucket.
Furthermore, one more feature set can be derived based
on the variable importance scores from the learned AdaBoost
Tree. During the construction of trees, AdaBoost algorithm
assigns a weight to each stage (tree), which in return, provides
the importance of the features used in these trees. This is
a totally automatic machine learning based feature selection
procedure, which can be regarded as another advantage of
AdaBoost Tree. The procedure is consist of two steps: 1) select
the top 75 features based on variable importance; 2) perform
a correlation analysis to remove features that are correlated
with each other. Now we are left with 57 features.
• Case 7 - 57 chosen features (57 features): features based
on importance scores from the AdaBoost tree.
We now train 8 different classifiers based on the 8 cases
discussed above and evaluate the results. We use a non-threat
424
2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile
6
Systems and Networks
TABLE III
T ESTING RESULTS OF OUR SYSTEM WITH DIFFERENT SUBSET OF
FEATURES . T HE TRAINING WEIGHT ( NORMAL - TO - MIX RATIO ) IS 2.2:1.
Cases
Case # Features
All Features
DNS Features
HTTP Features
Time/Host Features
Failure Features
Rare Features
Non-Rare Features
57 Chosen Features
0
1
2
3
4
5
6
7
136
20
28
62
16
58
54
57
TABLE IV
T ESTING RESULT WITH HIGH AND LOW
CONFIDENCE REGIONS .
Predicted
Predicted
Predicted
Error Rate
Nonthreat Type 1 Threat Type 2 Threat
Observed Nonthreat
2817
16
6
0.781% (FPR)
Observed Type 1 Threat
9
516
NA
1.71% (FNR)
Observed Type 2 Threat
14
NA
126
10.0% (FNR)
Testing result
False Positive
False Negative
Rate
Rate
0.78%
3.46%
2.88%
10.8%
3.65%
10.1%
1.63%
4.36%
4.98%
60.3%
1.76%
6.32%
3.70%
26.2%
1.63%
4.06%
large percentage of traffic to rare hosts.
In our original 1700 threat data, “Type 1 Threats” covers
71.76% of all different threats. All other threats are categorized
as “Type 2 Threat”, which are less suspicious in terms of
network traffic behavior and may be harder to detect. Results
from clustering the threats into these two categories are shown
in Table IV, implying that these two threat clusters lead to
two different confidence regions of detection. In particular, the
FNR for “Type 1 Threat” is only 1.71%, much better than the
overall 3.46% FNR, indicating the high confidence region. On
the other side, we manually check the 13 misclassified cases
(FP) in this region and find that these non-threats look very
suspicious. Recall that we assume all normal users’ data do not
contain threats, but this assumption may be too strong in that
some user devices may be actually infected by malwares, and
these 13 instances may belong to those “mislabelled” cases,
though it is also possible that the classifier should do a better
job for these misclassified cases with further investigation.
•
to threat ratio of 2.2:1 for the training and the results are show
in Table III. As to be expected, using all 136 features leads
to the best result with a FPR of 0.78% and a 3.46% FNR.
The feature set chosen with the feedback from AdaBoost Tree
(57 Features - Case 7) also shows good results with a 1.63%
FPR and a 4.06% FNR, which are the best results for the
reduced feature sets. This makes sense as AdaBoost is able to
identify the most relevant feature set. Very similar to the results
of AdaBoost chosen features are those for the Time/Host
Feature set (Case 3) which show the same 1.63% FPR and
a slightly higher 4.36% FNR. This shows the importance of
the Time/Host Feature sets. The other notable feature set is the
rare features (Case 5) which is only slightly worse than the
above two sets with a FPR of 1.76% but a larger 6.32% FNR.
Non-Rare or Failure or HTTP or DNS features by themselves
do not lead to very good results.
C. Generalizations
To examine the generalization ability of our developed
system to the detection of users infected with new unknown
malwares, we apply the same classifier derived in Section VI-A
to a testing data of new set of mixed data. More specifically,
the Adaboost classifier is trained with the mixed dataset
containing original 1700 threats, while the testing set consists
of the same non-threat data but mixed with 0.5TB threat data
set instead (Section II-A2). Here, we use the classifier trained
with the normal-to-mix training weight of 2.2:1. The testing
result (FNR) on the different dataset is 4.69%, only a little
worse compared with the original 3.46%. This implies that
the proposed system generalizes well and is robust to new
threat types, which also indicates its usefulness in the realworld scenario in detecting users infected by unseen threats.
D. Performance on Specific Groups of Threats
Based on earlier exploratory data analysis, we have observed
some threats have similar network behavior that are distinct
from normal users. To understand how our threat detection
system behaves for specific kinds of threats, we break the
threats into two groups according to some predefined suspicious behavior. Specifically, we define Type 1 Threats as
having any of these following properties
• high percentages DNS failures (non-existent domains),
• periodic traffic to rare hosts,
VII. F UTURE W ORK
For future work, we would like to improve our model
accuracy by incorporating a new two-stage modeling strategy,
which can potentially improve the detection accuracy without
much additional computational cost. The first stage is to
identify potential candidates for infected users with our current
methodology. Then for these potential candidates, we will
employ a second-level model that encompasses a richer feature
set to further fine tune the classification. We also need to
conduct simulation experiments in a more real time/streaming
scenario to quantify our system performance.
R EFERENCES
[1] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller,
“An overview of ip flow-based intrusion detection,” IEEE communications
surveys & tutorials, vol. 12, no. 3, pp. 343–356, 2010.
[2] P. Camelo, J. Moura, and L. Krippahl, “Condenser: A graph-based
approach for detecting botnets,” arXiv preprint arXiv:1410.8747, 2014.
[3] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh,
W. Lee, and D. Dagon, “From throw-away traffic to bots: detecting the
rise of dga-based malware,” in Presented as part of the 21st USENIX
Security Symposium (USENIX Security 12), pp. 491–506, 2012.
[4] R. Perdisci, W. Lee, and N. Feamster, “Behavioral clustering of httpbased malware and signature generation using malicious network traces.,”
in NSDI, pp. 391–404, 2010.
[5] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, “Exposure: Finding
malicious domains using passive dns analysis.,” in NDSS, 2011.
[6] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of
malware behavior using machine learning,” Journal of Computer Security,
vol. 19, no. 4, pp. 639–668, 2011.
[7] M. Antonakakis, R. Perdisci, W. Lee, N. Vasiloglou II, and D. Dagon,
“Detecting malware domains at the upper dns hierarchy.,” in USENIX
security symposium, p. 16, 2011.
[8] B. Binde, R. McRee, and T. J. OConnor, “Assessing outbound traffic to
uncover advanced persistent threat,” SANS Institute. Whitepaper, 2011.
425
Download