Dept. of Computer Science
College of Engineering, Cherthala
Alappuzha, India
Dept. of Computer Science
Dept. of Computer Science
College of Engineering, Cherthala
Alappuzha, India divya6725@gmail.com
I.
I NTRODUCTION manilaldl@gmail.com
Abstract
—
Recent botnets use DNS as command and control protocol to make their network traffic difficult to distinguish from legitimate web traffic. As current security measures are not considering DNS packets, it offers a covert C&C channel for botnets. This paper proposes a host based approach that extent user-intention based anomaly detection approach to detect bot triggered DNS packets. The method is very effective in detecting abnormal system events. Usually DNS queries are automatically issued by applications or the OS, and this method can be used to effectively detect malware triggered DNS traffic.
Keywords— Botnet detection; Command and control; DNS;
Network security
II.
College of Engineering, Cherthala
Alappuzha, India lominjoy@gmail.com
RELATED effectively detect malware triggered DNS traffic.
WORK
This paper proposes a host based approach that extent userintention based anomaly detection approach to detect bot triggered DNS packets. User-intention based anomaly detection is an effective method in detecting abnormal system events. Usually DNS queries are automatically issued by applications or the OS, and this method can be used to
Related work can be grouped in two categories. First, several botnet C&C detection techniques have been analyzed. These methods can be classified as protocol dependent and protocol independent methods. Some of these methods are network based and some are host based.
A botnet is a network of computers which are infected with malware that allows remote controllers (botmaster) to control them. Once a computer is taken over by a bot, botmasters use these machines to carry out many illegal tasks. Botnets are one
Research work in [1] presents a host based detection mechanism for bot C&C traffic by analyzing suspicious flows created after filtering out normal traffic from traffic generated of the most hazardous species of security threat on the Internet nowadays. They can be used even to wage cyberwars between different countries. Therefore development of techniques to detect botnet is very important. on a host. The filtering is based on a normal profile of the traffic generated by the user on a host. The method is designed for detecting HTTP based botnets and cannot be used for detecting DNS based C&C.
Command and control (C&C) is the protocol used by bot and botmaster to communicate each other. Bots receive new attack commands from botmaster and submit stolen data to the C&C
[2] presents two approaches for identifying botnet C&C servers based on anomalous DDNS traffic. The first approach consists in looking for domain names whose query rates are abnormally server through the C&C channel. They use existing common protocols in protocol-conforming manners to make the detection of botnet C&C a challenging problem. Advances in malware research have challenged botnet operators to explore high or temporally concentrated. High DDNS query rates may be expected because botmasters frequently move C&C servers.
The second approach consists in looking for abnormally recurring DDNS replies indicating that the query is for an new stealthy communication mechanisms to evade detection.
Some recent botnets use DNS tunneling to add stealth to
C&C[4]. Using DNS tunneling data can be transmitted by using DNS packets. Because nearly all traffic requires DNS to inexistent name (NXDOMAIN). Such queries may correspond to bots trying to locate C&C servers. First approach generates many false positives and second approach is more effective. resolve domain names to IP addresses and back, simple firewall rules cannot easily be created without affecting legitimate traffic. Even in environments with heavily restricted
Internet access, DNS is usually one of the few protocols, that is
[3] Proposes self-healing system architecture designed to increase resiliency against botnets with minimal disruption to network services. This method checks all DNS host files and the networks inbound ports for any signs of botnet presence. allowed to pass without further checking. At present, there are only a few methods to detect DNS based botnet communication.
The method depends on the list of inbound ports. Also it cannot be used for detecting communication via DNS C&C.
Research work in [4] attempted to detect Botnet based on
NetFlow analysis and states that using NetFlow data solely, for the purposes of botnet detection, is not possible. The research presented in [5] proposes a system which monitors outbound packets from a host and compares with destination-based whitelists. The white-lists are generated by observing an uninfected PC. Although this is a straightforward technique, the detection can be done only during the non-operating time of the
PC.
Second group of related work contains approaches to detect botnets that use DNS as C&C. [6] is the first documentation on
DNS based botnet C&C traffic. They discovered and reverse engineered Feederbot, a botnet that uses DNS as carrier for its command and control. And also provide a technique that use machine learning and traffic analysis to detect this particular type of malware. The method is based on rdata features and behavioral communication features. K-Means clustering and a
Euclidean Distance based classifier are used to classify DNS transactions of malware samples concerning DNS-C&C usage.
Unlike this method our method does not need any previous knowledge of botnet traffic.
Greg Farnham reviews DNS tunneling utilities and discusses practical techniques for detecting DNS tunneling [7]. Two categories of detection considered are payload analysis and traffic analysis. The payload detection techniques have been used to detect successfully specific DNS tunneling utilities.
The traffic analysis based technique can be used to universally detect DNS tunneling. Payload analysis has more computational overheads and traffic analysis cannot be used for real time detections. [8] also presents deep packet inspection as method for detecting DNS based C&C. Proposed method does not require any type of analysis of data inside the packet and can be used for real time detection of DNS C&C communication.
III.
PROPOSED METHOD
The user-intention based anomaly detection approach is effective in detecting abnormal system events. The method is designed for detecting HTTP based anomalies. Because DNS queries are automatically issued by OS or applications, the causal relations between user actions and DNS traffic may not be obvious. This method can be extended to identify anomalous DNS traffic on a host.
A.
User-Intention Based Anomaly Detection
Semantic dependency between user events and traffic events is analyzed in this method for anomaly detection. Dependence analysis on network traffic creates a traffic dependency graph
(TDG)[9] based on the network events and user actions.
Traffic-dependency graph (TDG) is a forest of chronologically ordered trees in which directed edges representing the dependencies among network events and user actions. Each tree is rooted with a user event, and other nodes of the trees are traffic events. An edge from event a to b means event b is caused by a.
User event refers to the user inputs to the system through the input devices such as the keyboard or mouse, which have attributes such as timestamp and ID of the process notified by the event. Traffic event refers to an outgoing DNS packet from the host, which includes attributes such as the process ID, timestamp, source and destination IP addresses and port numbers etc. Subroot is a special type of traffic event which is directly triggered by a user event.
By analyzing dependencies between user events and traffic events, vagabond events can be identified. Vagabond events refer to outbound network events that are not triggered by any user actions and may hence be due to anomalies. Breadth-first search (BFS) based dependency inference algorithm [9] is used for this analysis.
B.
Architecture
Different modules of the system are Hook module, Traffic module, Process module and Causal Relation Analyzer. Hook module collect user events and corresponding process information. Traffic module and process module monitor network events and find corresponding process information.
Causal relation analyzer implements dependency inference the algorithm.
Fig. 1.
Fig. 1.
Architecture diagram
The hook module sets up system hooks in order to collect kernel-level user events to the applications. This module, registers the hooks to log keyboard and mouse events. This module captures the keyboard and mouse events and obtains timestamp of that event. Furthermore, the process ID of the current foreground window is obtained in order to find out the corresponding process for each user event and a root node is created in TDG corresponding to the user event. Repetitive user events that do not generate traffic such as mouse movements are ignored.
The traffic module filters the network packets to record outbound DNS packets. Packet information such as source IP, source port, destination IP, destination port, timestamp etc. can be directly obtained from the packet. To obtain referrer relation between DNS packets, HTTP packets are also captured and
mapping between DNS packets and HTTP packets is found.
The process module obtains network and system (namely process) information about active connections and associate the network connections with the corresponding process IDs. This information is used to find the process id of outgoing DNS packets.
Causal Relation Analyzer implements the dependency inference algorithm. Given a new request, dependency inference (DI) algorithm aims to identify its dependence with respect to the known requests. A forest structure is constructed to store the network requests and organize according to the definition of TDG. The requests with known dependencies are chronologically organized into trees rooted at user events in the existing TDG.
C.
Algorithm
Dependency inference algorithm [9] is a breadth-first search
(BFS) based algorithm. For a new request, dependency inference (DI) algorithm aims to identify its dependence with respect to the known requests. A forest structure is constructed to store the network requests. The algorithm utilizes three procedures namely Is_Child, Is_Sibling, and Is_Subroot.
Is_Child checks whether a given node is a child node of any other nodes on TDG. Traffic event P b
is a child of P a
, if and only if
The interval between P a
and P b
is within a threshold and event P a
proceeds P b
.
P a
and P b
share the same (non-null) process ID.
There is a referring relation between P b
and P a
.
Is_Sibling procedure is used for identifying the sibling relation of a request whose parent nodes cannot be directly identified.
Given two DNS packets P a
and P b
, where P a
’s parent node is known, and P a
proceeds P b
. P b
is a sibling node of P a
,if and only if
The interval between of P a
and P b
is within a threshold and event P a
proceeds P b
.
P a
and P b
share the same (non-null) process ID.
Both requests have same referring relation with P a
’s parent node.
If requests P a
and P b
are siblings, P a
’s parent is also the parent of P b
. Is_Sibling helps identify late-arriving child nodes.
Is_Subroot procedure is to identify the traffic events of the type subroot. A pre-set threshold value is defined [9] as 15 seconds by considering the semantic information in network requests.
Algorithm: Dependency Inference Procedure
Input: A newly-observed traffic event p*; a forest F of chronologically ordered trees of events rooted at user events, which are parents of subroots {T1, ..., Tm} , where subroot Tm is the most recent one; and a threshold ד.
Return: True, if the parent node of request p* is found; False, otherwise.
1: if Is_Subroot(p*) then
2: Tm+1 ←p*
3: Append Tm+1 to forest F and update
Tm+1.newestTimestamp ←p *.timestamp
4: return True
5: else
6: for i ← m to 1 do
7: create Queue Q and insert the subroot Ti onto Q
8: for Q is not empty do
9: node n ← dequeue (Q)
10: if n.pid ≠ p*.pid or p*.timestamp –
n.newestTimestamp > ד then
11: go to line 8
12: else if Is_Child(n, p*) then
13: p *.parent ← n and update the newest
Timestamp for nodes on the path from p* to its
subroot node
14: return True
15: else if Is_Sibling(n, p*) and !Is_Subroot(n) then
16: p*.parent← n.parent and update the newest
Timestamp for nodes on the path from p* to its
subroot node
17: return True
18: else
19: for all children of node n do
20: enqueue the child nodes onto Q
21: end for
22: end if
23: end for
24: end for
25: end if
26: return False
IV.
IMPLEMENTATION
Hook module sets up system hooks in order to collect kernellevel user events to the application. The module, using jnativehook API, registers the hooks to log keyboard and mouse events. Listeners are setup to capture the user events.
Timestamp and other information of these user events are also captured. Furthermore, the process ID of the current foreground window is also obtained, to find out the corresponding process for each user event. JNA API is used for finding process ID of current foreground window. Then for each user event a root is created in the traffic dependency graph. Repetitive user events that do not generate traffic such as mouse movements are ignored.
Traffic module implemented with the jpcap library filters the network packets to record outbound DNS packets. Packet information such as source IP, source port, destination IP, destination port and timestamp are obtained directly from packet header. Referrer relationship between DNS packets
cannot be obtained directly from packets. Therefore, to find referrer relations, HTTP packets are also captured in parallel.
Then a mapping is found out between DNS packets and HTTP packets. Mapping is based on the following factors:
Domain in the DNS packet payload and Host in
HTTP packet header should be same
Timestamp of HTTP packet is within the threshold compared to the DNS packet
By analyzing the traffic relations, a threshold of 15 seconds can be used to find the mapping between DNS and HTTP packets
[11]. This mapping is used to find the referrer relation between
DNS packets. Referrer url of an HTTP packet can be obtained from HTTP packet header. If a DNS packet has a corresponding HTTP packet and that HTTP packet has a referrer HTTP packet, then the referrer of the DNS packet is the corresponding DNS packet of that referrer HTTP packet.
If the captured DNS packet is an instance of UDP packet, process ID of the packet is same as the pid of its corresponding
HTTP packet. Process ID of HTTP packets can be obtained from process information collected in process module. If the captured DNS packet is an instance of TCP packet, process ID of the packet can be directly obtained using process module.
After finding all the packet and process information they are serialized and saved. This serialized data is given to the causal relation analyzer to detect anomalous DNS packets.
The process module obtains network and system (namely process) information about active connections. The module associates the network connections with the corresponding process IDs. A timer is setup to periodically retrieve the list of
TCP connections. This list contains process id and port number corresponding to active connections. This information is stored in a hash table and it is updated periodically. When a packet arrives the process id corresponding to that packet can be obtained from the hash table using source port number.
Causal Relation Analyzer implements the dependency inference algorithm. It is a breadth-first search (BFS) based algorithm and is used for the TDG construction. Given a new request, dependency inference (DI) algorithm aims to identify its dependence with respect to the known requests. A forest structure is constructed to store the network requests and organize them according to the definition of TDG. The requests with known dependencies are chronologically organized into trees rooted at user events in the existing TDG.
The module analyzes the traffic events collected using traffic and process module. For each packet, the algorithm searches for a parent node in the TDG. If found, a node will be created for that packet and it will be added to TDG. And it is a legitimate packet. If parent node is not found, it will be classified as anomalous. Automatic DNS traffic generated from the system that do not have corresponding user events can be handled using pre-defined white listing rules.
V.
EVALUATION
To evaluate the system different datasets containing normal user traffic and bot traffic on a PC were recorded. The scenario of bot traffic is modeled by creating an application that automatically generates DNS queries. Experiment is conducted on three host machines and traffic is collected. Following metrics are calculated on mixed traffic to estimate the goodness of the classification provided by the proposed algorithm.
Precision: Percentage of data samples which are actually positive out of the total number which are classified as positive by the system.
Recall (Detection Rate): Number of malware instances detected by the system out of total number of malware instances present in the test set.
Precision is used to measure the correctness of the detection.
Recall measures the completeness. Neither recall nor precision alone can judge the goodness of an anomaly detection method.
Therefore Accuracy, or F1 score which is the harmonic mean of precision and recall, is used to find the performances of the detection system. False alarm rate (FAR) is really 1-Precision.
TABLE I. R ESULTS
Dataset Tot. alarms
Dataset
1
387
Dataset
2
Dataset
3
207
194
TP
191
59
133
P
147
30
105
R
0.76
0.5
0.78
F1
0.78
0.78
0.8
FAR
0.24
0.5
0.22
Table 1 shows the results of evaluation. Fig.2 shows the precision and recall graph. A detailed look at the traffic reveals that false positives are caused due to the difficulty in finding process id of DNS packets. To resolve this problem, further study is required in this direction.
Fig. 2 Precision and Recall graph
VI.
CONCLUSION
Analyzing the dependencies between network traffic and user activities is an efficient approach for anomaly detection. The traffic dependency graph captures the causal relations of user actions and network events for improving host integrity. The method can be used for detecting DNS C&C by detecting automatically generated DNS packets. False positives are caused due to the difficulty in finding process id of DNS packets. Future works intend to overcome this limitation with a more accurate method.
R EFERENCES
[1] Soniya Balram and M. Wilscy, User Traffic Profile for Traffic
Reduction and Effective Bot C&C Detection, International Journal of
Network Security, Vol.16, No.1, PP.37-43, Jan. 2014
[2] Ricardo Villamarin-Salomon and Jos Carlos Brustoloni, Identifying
Botnets Using Anomaly Detection Techniques Applied to DNS Traffic,
IEEE CCNC 2008
[3] Adeeb Alhomoud and Irfan Awan, Jules Ferdinand Pagna Disso,
Muhammad Younas, A Next-Generation Approach to Combating
Botnets, IEEE Computer Society, April 2013
[4] Vojtech Krmicek, Inspecting DNS Flow Traffic for Purposes of Botnet
Detection, GEANT3 JRA2 T4 Internal Deliverable 2011
[5] Takemori K, Fujimino ; Nishigaki, M. ; Takami, T. ; Miyake, Y.,
Detection of Bot Infected PCs Using Destination-Based IP and Domain
Whitelists During a Non-Operating Term, Global Telecommunications
Conference, 2008. IEEE GLOBECOM 2008. IEEE
[6] Christian J Dietrich, Christian Rossow, Felix C Freiling, Herbert Bos,
Maarten van Steen and Norbert Pohlmann,On Botnets that use DNS for
Command and Control, Computer Network Defense (EC2ND), 2011
Seventh European Conference
[7] Greg Farnham, DNS Tunneling , SANS Institute InfoSec Reading Room
[8] Kui Xu, Patrick Butler, Sudip Saha, and Danfeng (Daphne) Yao, DNS or Massive-Scale Command and Control, IEEE TRANSACTIONS ON
DEPENDABLE AND SECURE COMPUTING,VOL. 10, NO. 3,
MAY/JUNE 2013
[9] Hao Zhang,William Banick, Danfeng Yao, Naren Ramakrishnan, User
Intention-Based Traffic Dependence Analysis for Anomaly Detection,
IEEE CS security and Privacy Workshop
[10] Courtland Smith,Role of DNS in botnet command and control,
OpenDNS Security white paper, August 3, 2012
[11] Hao Zhang, Danfeng Yao, Naren Ramakrishnan, POSTER: A
Semanticaware Approach to Reasoning about Network Traffic
Relations, ACM