2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile Systems and Networks Statistical Network Behavior Based Threat Detection Jin Cao, Lawrence Drabeck and Ran He* Nokia Bell Labs, Murray Hill Email: {jin.cao, lawrence.drabeck, ran.he}@on.nokia-bell-labs.com Abstract—Malware, short for malicious software, contuses to morph and change. Traditional anti-virus software may have problems detecting malicious software that have not been seen before. By employing machine learning techniques, one can learn the general behavior patterns of different threat types and use these to detect variants of unknown threats. We have developed a malware detection system based on machine learning that uses features derived from a user’s network flows to external hosts. A novel aspect of our technique is to separate hosts into different groups by how common they are visited by the users and then develop user features separately for each of these host groups. The network data for the training of the detector is based on malware samples that have been run in a sandbox and normal users’ traffic that is collected from an LTE wireless network provider. Specifically, we use the Adaboost algorithm as the classification engine and obtain a good performance of 0.78% false alarm rate and 96.5% accuracy for detecting users infected with malwares. We also provide high and low confidence regions for our system based on subclasses of threats. I. I NTRODUCTION Internet security has become increasingly important as new and varied threats seem to proliferate at a fast pace. Many methods on threat detection rely on exact fingerprints of known malwares. For example, anti-virus software looks for file signatures and Intrusion Detection System (IDS) uses the network traffic signatures of threats such as specific character strings in the contents of their HTTP GET/POST. These methods enjoy a high precision rate, but are unable to detect unknown threats. Therefore, there is an intense interest in developing methods that are effective to detect those unseen threats in the outbreak, where the fingerprints are unknown in advance. Towards this goal, and to avoid the costly inspection of packet contents at high network speed, researchers have focused on flow-based approaches where they analyze the network flows instead of individual packets (see [1] for a comprehensive survey.) Furthermore, due to recent developments of machine learning algorithms and big data analytics, researchers have been increasingly relying on machine learning techniques in detecting malwares from the vast amount of network data. However, these methods often target a specific malware type such as Scans, Worms, Botnets and Denial of Services attacks ([2], [3], [4]), and typically focus on threat behavior patterns in a specific network application such as HTTP or DNS. In summary, research on a general purpose threat detection scheme is lacking. * alphabetical author list 978-1-5386-2784-6/17/$31.00 ©2017 IEEE The objective of our study is to develop an effective system sitting on a network perimeter that can be used to detect and alert the end users who are infected by a wide class of known/unknown threats. Our detection is done from the perspective of the individual end user, where the classification is done by examining the network traffic patterns between this end user and all the network hosts it communicates with. We contrast this with network based approaches where traffic from all end users is first aggregated and then analyzed (e.g. [5]). As our threat detection system is designed to be generalpurpose, we intentionally downplay the use of domain knowledge in our study and instead focus heavily on using machine learning to derive insights automatically. This focus is evident in different stages from exploratory data analysis to feature extraction to statistical model building and evaluation. Specifically, we analyze the network traffic from an infected user and compare it with the network traffic from a typical normal user’s device. We learn the malicious behavior patterns of a wide class of malwares and design a feature set that can capture these patterns. Based on these features, predictive models are built and used to detect an unknown malware. A main difficulty in our task, especially in the feature design, is that the traffic from an infected user is in fact a mixture of malicious and normal activities. Hence for the infected user, even though its malware activity can be distinguished easily from its normal activity, detecting malware activity from the mixture is much harder, as the statistical features from the malware may get lost in the mixture. To address this challenge, we introduce the concept of host slicing, where traffic from an individual user to external hosts is first separated (or sliced) by these host slices and features are extracted for these slices separately. If there is a good correlation between the malware activities and the host slices, such separation will make the detection much easier. In particular, through exploratory analysis we have found that the malware activities tend to engage unpopular external hosts much more than the normal activities do. For example, non-threat users are found to visit unpopular hosts 3% of the time whereas the threat users visit unpopular hosts 40% of the time. Based on this observation, we design two host slices. The first slice, which we refer to as non-rare hosts, contains all hosts that are visited somewhat frequently by a normal user. The second slice, which we refer to as rare hosts, contains the rest of the hosts. For an individual user, traffic between the user and these host slices are first split apart, and then features are computed separately for each host slice. 420 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile 2 Systems and Networks The following are our key contributions: • We develop a general purpose threat detection system for a wide class, not just specific types of threats, from the perspective of individual end users. • We establish malwares’ network behavior characteristics by contrasting them with normal network activities from ordinary users and develop features to charcterize these behaivor. It is shown later that host and time pattern based features are among the most important features in our built system. • We classify infected users under a mixed environment where the network traffic from an infected user is a mixture of threat and normal activities. We achieve this by dividing traffic between each user and its (external) hosts into two hosts slices: traffic to rare hosts and traffic to non-rare hosts, as we have shown that threat activities engage rare hosts much more so than normal activities. • In the mixed environment, we can achieve 96.5% accuracy for the detection of malwares while maintaining a 0.78% false positive rate. • For a sub-population of threats covering 70% of all malware samples with predefined behavior traits, the detection accuracy is 98.3%, better than the overall 96.5% detection rate. The overall design of our system consists of the following key steps: 1) an offline collection of threat and non-threat data as training sources; 2) an offline feature design and model tuning using traditional supervised learning process; 3) a near real time implementation of the model as the threat detector. Finally we comment that as our approach utilizes hostname specific features, in case of encrypted traffic, our method would not be affected as long as we can still accurately retrieve the hostnames. II. DATA The data used for our analysis and modeling are packet capture files (PCAP) that record the network traffic between the users and the network. The captured pcap files are processed to obtain data for later feature calculations. We extract two types of data from the pcap files. The first type, which we refer to as Basic Flow data (basicflow), is the tradictional IP flow defined as a 5-tuple of source IP and port, destination IP and port and protocol, and records the number of packets and bytes on a per flow basis for each direction. For several key applications, namely, DNS, HTTP, SMTP, FTP and ICMP, we also extracted Application Flow data (appflow) which gives details on the application level information. For example, for DNS, we would have DNS request and response code, the resolved DNS names, etc. For appflows, whenever possible, we also resolve the hostname from the application header. A. Threat Data Threat data is collected from our Motive Security Lab covering a wide spectrum of known threats. The pcap files were obtained by running the malicious code in a sandbox and recording the network traffic. Since the original aim of the sandbox study was to verify the signature of the threat, the threat was run only until the signature was verified. Therefore, many of the threat pcap files are fairly short in duration and small in the number of appflows. To have sufficient information on the threat behavior patterns for later feature calculations, we limit the threats we use in the study to have a minimum number of appflows (≥20). We use two different data sets of threat data captured by 2015. 1) Orignal 1700 Threats: This data set is comprised of 1713 different threats with one pcap file per threat. After limiting the threats to have a minimum of 20 appflows, this set is reduced to 503 threats and is comprised of 333 Win32, 165 Android and 5 other threats. The main Win32 threats are 26% Trojans, 17% Downloaders, 15% Adware, 9% Backdoor and some Virus, Worms, Bots, Spyware, and Password Stealer threats. The Android threats are 41% Trojans, 25% Mobile Spyware, 15% Backdoor, 7% Adware and some Bots, Downloader and Spyware threats. 2) 0.5TB Threats: This set is 0.5 TB in size and has 1002 different threats with 2 to 2000 different pcap captures per threat (381,828 total pcap files). The different pcaps files per threat have different MD5 signatures but are in the same class of threat. After filtering these threats for a minimum of 20 appflows and < 600 sec duration (this matches our non-threat data length) we have 78,719 different pcaps files from 742 different threats. B. Non-threat Data Non-threat data, which represents data from normal users, was obtained from a network tap in a wireless network providers LTE network in February 2016. The tap measures the S1u (eNB to SGW) and S11 (MME to SGW) network links but we only use the S1u traffic for this study. The market we capture data for contains around 200 eNodeBs and around 600 sectors and the data captured encompasses 70 minutes of time (450 GB of pcap files). After cleaning the data, matching users to their assigned IPv4 and IPv6 address and filtering, we obtain 56K user records. We only use a randomly sampled representative set of 8.5K users’ records (2.6M appflow records) for the study. C. Synthetic Generation of Infected Users The data we have as described above are either malware or normal traffic (assuming that our collected mobile user data do not contain any threats). However, traffic from an infected user will contain a mix of malware traffic and normal traffic. As our end goal is to build a detector of infected users, we need to have another collection of network traffic from infected users. In the following, we discuss our approach of generating synthetic traffic of the infected users using the existing malware and normal user data. To put in a simple way, what we do is a simple mixing of the two traffic sources. The mixing is accomplished by choosing pairs of a user and a threat randomly from our collection and then interleaving the data. The start time of the malware traffic in the mix is random so as not to bias the learning algorithm to always look for the threat at the beginning of the data. 421 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile 3 Systems and Networks 80 60 40 Percentage 0 20 80 60 40 20 0 Percentage 100 Non−threat 100 Threat 0.0 0.2 0.4 0.6 Fraction 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Fraction Fig. 1. Fraction of appflows to popular hosts that are visited by threats and normal users. We create two sets of synthetic infected users using the two different threat datasets, and study the generalization of using one set for model building and using the other set for testing. Note that the 70 minute duration of normal traffic is longer than all of the above threat datasets, implying that threat data will be completely imbedded in the normal traffic without any overrun of time when we create our synthetic users infected with malwares. To mimic a real time application scenario where our system is used to detect potential malwares from the user traffic in a time window continuously, we apply a fixed stop time for mixed traffic when we perform the mixing. That is, we discard all traffic after the pre-specified stop time. In our implementation, we set the stop time to be 10 minutes. III. E XPLORATORY DATA A NALYSIS A key step in our study is to understand how the malware network traffic from an infected user is different from its normal behavior. We investigate this from various angles such as time dynamics, interactions between the network applications and the communicating hosts. We extract a few key observations from the data exploratory analysis that have been widely explored in the literature. These observations inspire us on the feature designs as well as our innovative design of host slices, which is one of the key contributions of this paper. Observation 1. Many malwares exhibit periodic traffic patterns, even though their manner in which the periodicity manifests could differ significantly. Sometimes the periodicity is to individual hosts and sometimes to a group of hosts collectively. Observation 2. Malwares are more likely to produce failure events such as the high percentage of DNS failures. Observation 3. Malware traffic and normal traffic tend to show different time dynamics, and interactions between hosts and application protocols tend to be different. Observation 4. Normal non-threat traffic from actual users tend to go to popular websites more often than malwares do. Figure 1 show the percentage of appflows that go to popular hosts for both normal users and malwares. The popular hosts are defined by the top 10,000 sites in the Alexa 1M top internet sites list. As it can be seen, for more than 80% of the threats, their fraction of appflows to popular sites is less than 10%. On the contrary, normal users have a much larger fraction of Fig. 2. Overall strategy of splitting network traffic to two slices for rare and non-rare hosts. the appflows going to popular sites. This is intuitive as many human activities are common: for example we visit search engines for information, go to news websites for latest news, or visit social network sites for social activities, or do online shopping. On the contrary, malwares do not mimic human behaviors and are more likely to visit some obscure websites. Of course, this is not to say the malware do not visit popular websites at all as some of them even use popular sites (like Facebook) for command and control, but the majority of traffic do not mimic those resulted from human actions. IV. F EATURE E NGINEERING Although the statistical behaviors between threat and nonthreat traffic are quite distinct as we have shown earlier, we face significant challenges in differentiating between the traffic from an infected user and that from a normal user. This is because, the former will be a mixture of threat and non-threat activities. In this case, statistical features from the threat traffic may get lost in the mixture. For example, if threat traffic is periodic, when it mixes with the user’s normal traffic, the mixture would not have periodic patterns. Therefore, features computed on the entire traffic mixture may not be effective for our purpose. To address the above challenge in the mixed environment, our approach is to create host slices that are designed to make the distinction between mixed traffic and normal traffic much clearer. For an individual user, specifically, traffic between the user and these host slices are first split apart, and then features are extracted separately for each host slice. A. Two Host Slices: Rare Hosts and Non-Rare Hosts There are many ways to define host slices, such as host clusters that have similar characteristics via clustering analysis. However, in this paper, we employ a simple way to create host slices in order to illustrate the effectiveness of our design but it can be definitely generalized to other host slices. Specifically, we create two host slices: one slice, which we refer to as non-rare hosts, contains all hosts that are visited somewhat frequently for a normal user; the other slice, which we refer to as rare hosts are the rest of the hosts. This is inspired by the observation in the exploratory analysis that threat traffic often engages unpopular hosts (Figure 1). Our general strategy is illustrated in Figure 2. To be more specific, we define a rare host as an infrequently visited host by normal users that comes from an unpopular domain, where domain refers to the 2nd level domain (e.g. 422 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile 4 Systems and Networks TABLE I AVERAGE NUMBER Type Threat Normal user OF APPFLOWS PER USER TO RARE / NON - RARE HOSTS . Non-rare hosts 192 188 Rare hosts 129 5.4 (40% are rare) (2.7% are rare) google.com) and host refers to the complete hostname or URI (e.g. googlehosted.l.googleusercontent.com). We require the rare host to come from an unpopular domain to accommodate personalized service advertisement from popular domains such as googleusercontent.com. We define infrequently visited host and unpopular domain using both the Alexa 1M popular domain list and our non-threat LTE cellular data. Precisely, a rare host is defined as having both of these properties: 1) infrequently visited, where the number of visited users from our non-threat LTE data ≤ u1 2) from an unpopular domain, where an unpopular domain is defined by i) the number of visited users ≤ u2 and ii) not in the Alexa top 1M domains. In our experiment, we use u1 = 1 and u2 = 5. Table I contains summarizing statistics of the set of non-rare and rare hosts in our datasets. The difference in behavior between the threat and non-threat data indicates the possibility of using this host grouping to unmix the non-threat and threat traffic. B. Features Characterizing Traffic Behavior Patterns for each Host Slice As shown in Figure 2, for each user we split its flow data into two buckets, one for traffic to non-rare hosts and the other for traffic to rare hosts. For each bucket, we derive user features to characterize the statistical behavior of the network traffic from the individual user to the set of external hosts. These features form largely 6 categories, with each category designed to capture one aspect of the network behavior. 1) Basic statistics: Basic statistics of the user traffic, such as the number of appflows and the percentage of uplink appflows. These basic statistics capture the general information of the network behavior from a user device. 2) DNS-based features: These features allow us to characterize different properties of DNS appflows and include, for example, the percentage of DNS failures and the ratio of DNS responses to requests. Some of the features are adopted from [5]. 3) HTTP/HTTPS-based features: These features are designed for characterizing HTTP/HTTPs appflows as well as the strings contained in HTTP URIs. These features include, for example, the percentage of HTTP failures and the median length of the URIs. Some of these features were suggested by [4]. In fact, signature based threat detection relies heavily on HTTP/HTTPS signatures. We examined 2, 515 SNORT rules used to detect our 1, 713 threats and found that 1, 502 threats use very specific signatures in the HTTP/HTTPS traffic. 4) Hostname-based features: We do not use the exact names of hosts as features because they do not generalize well to new unobserved hosts. Our hostname based features are designed to extract general information from host names. which can be used for prediction. For example, the number of characters in host names, the length of longest non-meaningful strings in host names and the percentage of numeric characters. Some of these metrics are inspired by [2]. 5) Time pattern-based features: Based on our earlier observations (e.g. Observation 1 and 3 in Section III), threats and non-threats show very different time patterns when they communicate to external hosts. Therefore features computed based on time patterns are very important. Examples of these features are the median of inter-arrival times between consecutive appflows and the number of hosts to whom the user’s visiting pattern that shows periodicity. Other examples include the number of hosts that have bursty or short-lived visiting patterns, where the bursty patterns are common when a malware is activated and generates a sudden increase in the volume of appflows, while transient patterns are most seen in normal users’ activities such as browsing a website. As we will show in later evaluations, these time pattern-based features are in fact very effective in distinguishing threats from normal users. 6) Basicflow-based features: Appflows only target specific applications, while basicflows applies to all applications/ports. This set of features are statistics extracted from basicflows directly. For example, the number of unidirectional flows (flows that have only and uplink or downlink but not both). In total we developed a list of 54 features that capture the statistical network traffic behavior from an individual user to a set of external hosts. Since we divide the external hosts into two buckets, rare hosts and non-rare hosts, we have a total of 54×2=108 features for both traffic buckets altogether. Additionally, we also design a set of 28 features that captures the overall traffic pattern, such as the percentage of overall appflows to hosts in the rare bucket. In total, our threat detection system contains a list of 136 features. V. M ACHINE L EARNING A LGORITHMS Our primary goal in an unknown threat scenario is to learn from observed data via building a model and to predict whether new data contains infected traffic or not. This is considered as a classification problem in the terminology of machine learning, i.e., identifying to which of two categories (infected by malwares or not) a new user belongs, on the basis of a training set of data containing observations whose category memberships are known. Algorithms: We have evaluated four standard machine learning techniques: Naive Bayes, Classification Trees, Random Forest and Boosting Trees. Two primary factors for the choice of the learning algorithms are: 1) accuracy and 2) treatment of missing data. The reason why the latter is important is because there are many missing values due to our feature design. For example, for a normal user if we only observe traffic from non-rare hosts then many features related to the rare traffic are missing. We evaluate these four algorithms by cross-validation and pick Boosting Trees, or more specifically, AdaBoost Tree, as it can achieve high accuracies and handle missing data using surrogate variables. Moreover, AdaBoost tree enjoys the benefit of being less susceptible to the overfitting problem. 423 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile 5 Systems and Networks Parameter Tuning: To get better performance from the AdaBoost Tree classifier, parameter tuning is necessary. There are two regularizing parameters for the model, a Shrinkage parameter ν, or learning rate, which scales the contribution of each tree by a factor of ν. The other tuning parameter is the number of iteration M , which control the complexity of the final ensemble trees. A small value of ν typically requires large value of M , in order to achieve the comparable performance. In practice, a small value of ν with a large value of M are used to have a better prediction performance. We tuned these two parameter with the help of grid search and Kfold cross-validation. Grid search is to divide the parameter space (2 dimensional since we have two parameters to be tuned) into small grids, each one corresponding to a pair of (ν, M ) value, and try out all these pairs to find the one that optimizes the testing performance. The performance is evaluated by averaging the prediction results, such as testing misclassification rate, on all these K folds in K rotations. Performance Metrics: Throughout the study, we define Non-threat instances as “Negative” examples and Threat instances as “Positive” examples. Our evaluation are primarily based on two metrics associated with misclassification rates, False Negative Rate (FNR) and False Positive Rate (FPR). VI. E VALUATION As explained in Section II-C, we generate synthetic infected users by mixing the threat dataset with normal users’ activity. In this section, we evaluate the performance of our threat detection system using models built from this data and highlight some discoveries. Our main data for evaluation uses the original 1700 threats (Section II-A1) in the mixing and it contains 1,995 “infected” users and 8,517 normal users as the control set. We later in SectionVI-C study the generalization performance of this system on a different mixed data using the 0.5TB threat data (Section II-A2). A. Results and Training Weight We have chosen the AdaBoost Tree classifier for our system due to its advantages mentioned in Section V. Its parameters are tuned via grid search and cross-validation and are optimized to be (ν, M ) = (0.4, 800). Then the following experiment is performed using mixed dataset combined with control set: 1) Compute all 136 features as in Section IV for each user in these two datasets. Add to the feature list a label of threat (mixed) or non-threat. 2) Randomly select 2/3 of the threats and 2/3 of the nonthreat users from the feature matrix as the training dataset. The remaining 1/3 are used as testing data. 3) Fit an AdaBoost classifier to the training data and test on the testing data. Note that the training weight, i.e., normal-to-mix ratio can be varied in order to control the tradeoff between FPR and FNR. Results are summarized in Table II. With the original normal-to-mix ratio of data itself (8517/1995=4.3:1), we obtain a False Positive Rate of 0.17% (mistakenly classify nonthreat as threat) and a False Negative Rate of 3.76% (mistakenly classify threat as non-threat). But in practise, different use TABLE II T ESTING RESULT FOR OUR SYSTEM WITH DIFFERENT NORMAL - TO - MIX TRAINING WEIGHTS . Normal-to-mix ratio of 4.3:1 Predicted Non-threat Predicted Threat Error Rate Observed Non-threat 2834 5 0.176% Observed Threat 25 640 3.76% Normal-to-mix ratio of 2.2:1 Predicted Non-threat Predicted Threat Error Rate Observed Non-threat 2817 22 0.78% Observed Threat 23 642 3.46% cases may have different preferences on the tradeoff between FPR and FNR, and controlling normal-to-mix ratio is able to achieve this goal. For example, with a normal-to-mix ratio of 2.2:1, the FNR is reduced to 3.46%, though the FPR slightly increases to 0.78%. B. Feature Reduction We also investigate the results of our threat detection system with a reduced number of features to better understand the feature importance and minimize the computational costs of our system. Our first attempt is to pick 6 meaningful feature sets, each one corresponding to one aspect of features. We will also show that the feedback from the Adaboost model can help select a more informative set of features. Our first 6 reduced sets of features are: • Case 1 - DNS-based features (20 features) • Case 2 - HTTP-based features (28 features) • Case 3 - Time/host-based features (62 features): features derived from the network traffic time dynamics (such as burstiness, periodicity, short-liveness) and from hostnames (e.g. the number of characters), irrespective of the application. • Case 4 - Failure-based features (16 features): features related to DNS failures or HTTP failures. • Case 5 - Rare host features (58 features): features derived from traffic to rare hosts, i.e., the features falling into rare bucket, as explained in Figure 2. • Case 6 - Non-rare host features (54 features): features derived from traffic to non-rare hosts, i.e., the features in non-rare bucket. Furthermore, one more feature set can be derived based on the variable importance scores from the learned AdaBoost Tree. During the construction of trees, AdaBoost algorithm assigns a weight to each stage (tree), which in return, provides the importance of the features used in these trees. This is a totally automatic machine learning based feature selection procedure, which can be regarded as another advantage of AdaBoost Tree. The procedure is consist of two steps: 1) select the top 75 features based on variable importance; 2) perform a correlation analysis to remove features that are correlated with each other. Now we are left with 57 features. • Case 7 - 57 chosen features (57 features): features based on importance scores from the AdaBoost tree. We now train 8 different classifiers based on the 8 cases discussed above and evaluate the results. We use a non-threat 424 2017 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS): MobiSec 2017: Security, Privacy, and Digital Forensics of Mobile 6 Systems and Networks TABLE III T ESTING RESULTS OF OUR SYSTEM WITH DIFFERENT SUBSET OF FEATURES . T HE TRAINING WEIGHT ( NORMAL - TO - MIX RATIO ) IS 2.2:1. Cases Case # Features All Features DNS Features HTTP Features Time/Host Features Failure Features Rare Features Non-Rare Features 57 Chosen Features 0 1 2 3 4 5 6 7 136 20 28 62 16 58 54 57 TABLE IV T ESTING RESULT WITH HIGH AND LOW CONFIDENCE REGIONS . Predicted Predicted Predicted Error Rate Nonthreat Type 1 Threat Type 2 Threat Observed Nonthreat 2817 16 6 0.781% (FPR) Observed Type 1 Threat 9 516 NA 1.71% (FNR) Observed Type 2 Threat 14 NA 126 10.0% (FNR) Testing result False Positive False Negative Rate Rate 0.78% 3.46% 2.88% 10.8% 3.65% 10.1% 1.63% 4.36% 4.98% 60.3% 1.76% 6.32% 3.70% 26.2% 1.63% 4.06% large percentage of traffic to rare hosts. In our original 1700 threat data, “Type 1 Threats” covers 71.76% of all different threats. All other threats are categorized as “Type 2 Threat”, which are less suspicious in terms of network traffic behavior and may be harder to detect. Results from clustering the threats into these two categories are shown in Table IV, implying that these two threat clusters lead to two different confidence regions of detection. In particular, the FNR for “Type 1 Threat” is only 1.71%, much better than the overall 3.46% FNR, indicating the high confidence region. On the other side, we manually check the 13 misclassified cases (FP) in this region and find that these non-threats look very suspicious. Recall that we assume all normal users’ data do not contain threats, but this assumption may be too strong in that some user devices may be actually infected by malwares, and these 13 instances may belong to those “mislabelled” cases, though it is also possible that the classifier should do a better job for these misclassified cases with further investigation. • to threat ratio of 2.2:1 for the training and the results are show in Table III. As to be expected, using all 136 features leads to the best result with a FPR of 0.78% and a 3.46% FNR. The feature set chosen with the feedback from AdaBoost Tree (57 Features - Case 7) also shows good results with a 1.63% FPR and a 4.06% FNR, which are the best results for the reduced feature sets. This makes sense as AdaBoost is able to identify the most relevant feature set. Very similar to the results of AdaBoost chosen features are those for the Time/Host Feature set (Case 3) which show the same 1.63% FPR and a slightly higher 4.36% FNR. This shows the importance of the Time/Host Feature sets. The other notable feature set is the rare features (Case 5) which is only slightly worse than the above two sets with a FPR of 1.76% but a larger 6.32% FNR. Non-Rare or Failure or HTTP or DNS features by themselves do not lead to very good results. C. Generalizations To examine the generalization ability of our developed system to the detection of users infected with new unknown malwares, we apply the same classifier derived in Section VI-A to a testing data of new set of mixed data. More specifically, the Adaboost classifier is trained with the mixed dataset containing original 1700 threats, while the testing set consists of the same non-threat data but mixed with 0.5TB threat data set instead (Section II-A2). Here, we use the classifier trained with the normal-to-mix training weight of 2.2:1. The testing result (FNR) on the different dataset is 4.69%, only a little worse compared with the original 3.46%. This implies that the proposed system generalizes well and is robust to new threat types, which also indicates its usefulness in the realworld scenario in detecting users infected by unseen threats. D. Performance on Specific Groups of Threats Based on earlier exploratory data analysis, we have observed some threats have similar network behavior that are distinct from normal users. To understand how our threat detection system behaves for specific kinds of threats, we break the threats into two groups according to some predefined suspicious behavior. Specifically, we define Type 1 Threats as having any of these following properties • high percentages DNS failures (non-existent domains), • periodic traffic to rare hosts, VII. F UTURE W ORK For future work, we would like to improve our model accuracy by incorporating a new two-stage modeling strategy, which can potentially improve the detection accuracy without much additional computational cost. The first stage is to identify potential candidates for infected users with our current methodology. Then for these potential candidates, we will employ a second-level model that encompasses a richer feature set to further fine tune the classification. We also need to conduct simulation experiments in a more real time/streaming scenario to quantify our system performance. R EFERENCES [1] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller, “An overview of ip flow-based intrusion detection,” IEEE communications surveys & tutorials, vol. 12, no. 3, pp. 343–356, 2010. [2] P. Camelo, J. Moura, and L. Krippahl, “Condenser: A graph-based approach for detecting botnets,” arXiv preprint arXiv:1410.8747, 2014. [3] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon, “From throw-away traffic to bots: detecting the rise of dga-based malware,” in Presented as part of the 21st USENIX Security Symposium (USENIX Security 12), pp. 491–506, 2012. [4] R. Perdisci, W. Lee, and N. Feamster, “Behavioral clustering of httpbased malware and signature generation using malicious network traces.,” in NSDI, pp. 391–404, 2010. [5] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, “Exposure: Finding malicious domains using passive dns analysis.,” in NDSS, 2011. [6] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis of malware behavior using machine learning,” Journal of Computer Security, vol. 19, no. 4, pp. 639–668, 2011. [7] M. Antonakakis, R. Perdisci, W. Lee, N. Vasiloglou II, and D. Dagon, “Detecting malware domains at the upper dns hierarchy.,” in USENIX security symposium, p. 16, 2011. [8] B. Binde, R. McRee, and T. J. OConnor, “Assessing outbound traffic to uncover advanced persistent threat,” SANS Institute. Whitepaper, 2011. 425