WEB MIMICRY ATTACK DETECTION USING HTTP TOKEN

advertisement
International Journal of Innovative
Computing, Information and Control
Volume 7, Number 7(B), July 2011
c
ICIC International ⃝2011
ISSN 1349-4198
pp. 4347–4362
WEB MIMICRY ATTACK DETECTION USING HTTP TOKEN
CAUSAL CORRELATION
Ching-Hao Mao1 , Hahn-Ming Lee1,2 , En-Si Liu1 and Kuo-Ping Wu1
1
Department of Information Science and Information Engineering
National Taiwan University of Science and Technology
No. 43, Sec. 4, Keelung Rd., Taipei 106, Taiwan
{ d9415004; hmlee; m9615046; wgb }@mail.ntust.edu.tw
2
Research Center for Information Technology Innovation
Academia Sinica
128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan
Received March 2010; revised November 2010
Abstract. Web mimicry attacks evade anomaly-based web intrusion detection systems (WIDSs) by inserting non-functional characters. In this study, we propose a web
mimicry anomaly detection method that uses HTTP token causal correlation. The proposed method extracts token sequences from web application HTTP requests and models
the token correlation based on conditional random fields (CRFs) in order to identify web
mimicry attacks. The CRF model is widely used for solving sequence labeling problems
and suitable for capturing the dependency among different tokens in a token sequence.
Since CRFs relax the strong independence assumptions that the other probabilistic sequence analysis methods (e.g., hidden Markov model) have, it can capture long term
dependencies among the observed sequences of web tokens to improve the detection capability by observing significant attack patterns. The proposed method requires only HTTP
request information and can be easily plugged into existing intrusion detection systems.
Two datasets from “ECML/PKDD 2007’s Analyzing Web Traffic challenge” and real
world HTTP traffic data extracted from a private telecom company are used for an evaluation. The experiment results show that the proposed system performs well in the detection of both web mimicry attacks and general web application attacks even in heavily
intersecting cases.
Keywords: Web security, Mimicry attack, Conditional random field, Cross site script,
SQL injection
1. Introduction. Since many economic activities such as shopping and cyber banking
can be performed through web applications, web applications attract the attention of
hackers who break into computer systems with the aim of obtaining valuable information
[1,23]. The anomaly detection paradigm, which models normal behavior, is widely used
in protecting web application services and particularly useful for detecting novel and
unknown attacks. Attackers thus try to hide their malicious behavior by mimicking
normal behavior to deceive an anomaly-based web intrusion system (WIDS). This type
of attack behavior results in a high false negative detection performance for an anomaly
based WIDS making it difficult to identify known attacks. The term “mimicry attack”
was first introduced by Wagner et al. [21] to describe attacks that allow a sophisticated
attacker to craft malicious actions to evade a detection mechanism. Mimicry attacks also
exist at the network traffic application level, involving malicious web application behavior
used to evade WIDSs [12,15]. In this study, we use the term web mimicry attacks to
denote mimicry attacks involving the network traffic at a web application. Similar to
host or network mimicry attacks, web mimicry attacks usually use evasion techniques to
4347
4348
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
match a normal appearing behavior model and manifest the malicious injected code as
normal-liked web scripts from the statistical or probabilistic viewpoints.
Although the concept of web mimicry attacks is an extension of network/host mimicry
attacks, some differences exist in web mimicry attacks in anomaly intrusion detection
systems based on the host-based [21] or payload-based [13]. These differences are the result
of the high variations in web environments. Web application traffic contains dynamic and
diverse information involving various languages, libraries, scripts and human concepts.
The high variation in web traffic makes it more difficult to build a normal behavior model.
Attackers can exploit the high variation in network environments to craft sophisticated
web mimicry attacks that appear to be normal web scripts. Consequently, a web anomaly
IDS requires a high variation tolerance mechanism to avoid being easily deceived by web
mimicry attacks.
Web mimicry attacks usually utilize several evasion techniques to transform common
attacks into more sophisticated attacks to evade detection. The three major types of
evasion techniques are as follows:
1. Substitution, which replaces an attack with another that is functionally equivalent,
for example, utilizing URL encoding to replace “ ′ or” with “%27%6F %72”.
2. Interchange, which replaces an attack with any valid permutation of the attack. For
example, in an SQL injection attack, the “ ′ or 4 = 4– –.” string is equal to “ ′ or 5
>– –.”
3. Insertion, which inserts normal characters or meaningless into an attack in order to
make the attack match the normal model of the anomaly detection system, as shown
in Figure 1.
Original Pattern
A
B
C
D
Insertion Pattern
A
B
X
Y
Z
C
D
Figure 1. Demonstration of insertion pattern produces in mimicry attacks
(The filled circles denote the inserted tokens for mimicking normal behavior
to evade anomaly detection)
Among the above three evasion techniques, the insertion-typed evasion technique poses
the most intractable problem for both signature and anomaly-based detection mechanisms. Therefore, detecting insertion-type attacks could be helpful for identifying web
mimicry attacks.
Many anomaly detection methods have been proposed to protect web applications,
such as N-gram [22], deterministic finite automation (DFA) [12], Morkov Model [15,20]
and hidden Markov model (HMM) [8]. Raju et al. [7] give a comprehensive survey related
to the sequential pattern and clustering pattern for Web usage data. Ingham and Inoue
[11] proposed a framework for testing conventional web anomaly IDS algorithms and
showed the serious limitations of all approaches for detecting web mimicry attacks. They
proposed two important criteria for preventing web mimicry attacks: (1) a token-based
parsing mechanism and (2) robust token correlation modeling. The first criterion implies
that token-based algorithms should be used instead of character-based algorithms for
modeling the high-level structure in valid web application HTTP requests. The second
criterion shows that the relationship among the tokens is important for capturing the
semantics of valid HTTP requests. Conventional anomaly detectors do not usually apply
WEB MIMICRY ATTACK DETECTION
4349
both of these criteria and, therefore, the anomaly scores are skewed by the insertion of
normal or irrelevant characters. Because these methods do not perfectly consider the
token correlation of the sequence structure in HTTP requests, a normal model built by
a conventional approach cannot adequately describe the normal behavior of valid HTTP
requests. Table 1 shows whether conventional methods can detect the three types of
evasion techniques. It should be noted that none of these methods address the insertion
issue 1.
Table 1. Conventional methods for web anomaly detection system
Token-based N-gram
DFA
HMM
Markov model
Substitution Interchange Insertion
Limitation
O
O
X
Weakness of similarity measure
O
O
X
Weakness of similarity measure
O
O
X
Strict independent assumption
O
O
X
High false positive
To find outdetecting web mimicry attacks, we must overcome the high variation of web
traffic for when modeling the web scripts token structures in HTTP requests. Therefore,
we propose a mechanism that applies the Conditional Random Fields (CRFs) to model the
token correlation of the sequence structures in HTTP requests in order to find outdetect
insertion-type web mimicry attacks. HTTP requests are transformed into token sequences
for profiling the proximity token correlation based on CRFs, which build probabilistic
models [16]. In addition to using CRFs to model the token correlation of HTTP requests,
the proposed method also recognizes subtle web mimicry attacks which are including
the insertion ofed by normal-liked or meaningless characters. One advantage of CRFs
is that it is practical to represent the long range dependencies of the tokens in a token
sequence. Another advantage of CRFs is that labeling the current token depends on not
only the adjacent tokens, but also the whole web token sequence. Our approach can
effectively identify web mimicry attacks based on the basis of CRFs, and is also effective
for different ratios of insertion ratios and general web attacks. The experiment results
indicate that modeling the correlation of HTTP requests, based on CRFs, is suitable for
finding outdetecting web mimicry attacks.
We summarize the main contributions of the proposed approach as follows:
1. It exploits the advantage of CRFs to detect insertion web mimicry attacks. Furthermore, our approach is also effective in identifying general web attacks and different
insertion ratios.
2. Instead of building huge attack patterns, our approach can automatically build a
probabilistic model to identify web mimicry attacks.
3. The proposed method can be easily plugged into existing intrusion detection n
systems for identifying subtle web attacks.
The remainder of this paper is organized as follows: Section 2 describes the architecture
of the proposed method; Section 3 introduces the performance evaluation of the proposed
approach; finally, Section 4 presents the conclusion and discussion.
2. Problem Statement and Preliminaries. Token-based methods (e.g., DFA) are
widely applied in web application attack detection, but most token-based methods do not
perfectly model token correlation with insertion. In this study, we proposed capturing
many more token correlations to distinguish valid and malicious web application HTTP
requests. The proposed method extracts the token sequence from HTTP requests and
models the token correlation of token sequences based on CRFs in order to identify web
mimicry attacks, especially insertion cases. As shown in Figure 2, the system architecture
4350
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
is composed of three modules: the Token Sequence Extractor, CRF-based Token Correlator and Web Mimicry Attack Detector. The Token Sequence Extractor is responsible
for transforming HTTP requests into token sequences. The CRF-based Token Correlator
uses CRFs to extract the proximity token correlation and labels token sequence for building the CRF model. The Web Mimicry Attack Detector is used to determine whether an
incoming token sequence is normal or is an attack.
Expert
Labeled
HTTP Requests
New HTTP
Requests
Attack?
Token Sequence
Extractor
CRFs-based
Token Correlator
Web Mimicry
Attacks Detector
CRFs
Parameters
Profile
Training phase
Testing phase
Figure 2. Proposed system architecture. HTTP requests are considered as
token sequences, and the token correlation is modeled through conditional
random fields in order to identify web mimicry attacks.
Additionally, the system utilizes a data source and an intermediate profile, called labeled
HTTP requests and CRF parameter profile, for the construction of a CRF-based token
correlation model. The labeled HTTP requests represent the HTTP requests classified by
domain experts in advance. CRF Parameter Profile records the parameters of the CRF
model and is used to determine the label of the predicted token sequences with the Web
Mimicry Attack Detector.
2.1. Token sequence extractor. The Token Sequence Extractor is used to covert an
HTTP request into a token sequence in order to model the token correlation. The conventional method [12] usually extracts tokens according to the HTTP RFC standard [5]. It is
difficult to parse HTTP traffic in a straightforward way because most web browsers and
web robots do not fully follow the RFC standard. In order to overcome these problems,
we utilize a special symbolic HTTP parsing mechanism to deal with complicated HTTP
requests. The high variation characteristic of HTTP traffic makes modeling the normal
behavior of HTTP request becomes more difficult. The elimination of this high variability
in token sequences is an efficient way to model the potentially meaningful token correlation, as well as gain insight into the semantic meaning of an HTTP request. The tokens
contain words (letters and numbers) and special symbols. Algorithm 1 accepts HTTP
requests as inputs and generates relative token sequences as outputs. An HTTP request
WEB MIMICRY ATTACK DETECTION
4351
consists of a series of single characters, which are sequentially extracted from the HTTP
request. Special symbols are used in an HTTP request to delimit the token boundary,
and these special symbols are output as tokens. Continuous letters and numbers without
special symbols are output as a token. The detailed pseudo code is described in Algorithm
1. Figure 3 shows an example of token sequence extraction. High variability generalizing
and obfuscated character decoding shown in Algorithm 1 are next described in detail.
Figure 3. Example of token sequence extractor. It segments an HTTP
request into a token sequence in order to find the token correlation.
2.1.1. High variability generalizing. High variability generalizing is used to reduce the
high variability in HTTP requests in order to extract significant tokens from token sequences. Ingham et al. [12] mentioned that high variability universally exists in HTTP
requests, because HTTP traffic contains not only protocols and parameters but also various languages, libraries, scripts and human concepts. In an anomaly detection system,
4352
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
this high variability makes it difficult to profile normal behavior from normal HTTP requests. Ingham et al. [12] proposed parsing heuristics to generalize the high variability in
HTTP requests. We eliminate or transform the non-meaningful or non-informative tokens
in token sequences to reduce the impact from variability through the use of some parsing
heuristics, as shown below:
• All the upper case letters in HTTP requests are rewritten as lower case letters [12],
e.g., “HTTp://wWw.GooGLe.COM” is transformed into “http://www.google.com”.
• IP addresses in HTTP requests are replaced with other strings. (e.g., “IP”), e.g.,
“118.169.1.1” is transformed into “IP”.
• Dates in HTTP requests are parsed and replaced with other strings. (e.g., “date”),
e.g., “2009/9/5” is transformed into “TIME”.
• Since hash values in literal strings are mostly meaningless, we remove them from
HTTP requests, e.g., “hash value:0934394302430” is transformed into “hash”.
• Redundant Characters Remover: this removes the redundant characters that are
inserted in attack strings, without making the attacks invalid, e.g., “or+++1+++
= +++1” is transformed into “or 1 = 1”.
2.1.2. Obfuscated characters decoding. Obfuscated characters decoding is used to decode
characters that have been obfuscated by encoding techniques, such as URL Encoding and
UTF-8 or UTF-16 encoding techniques. Even though encoding techniques are often used
in a legitimate way, they can also be applied to craft web mimicry attacks. Hackers utilize
these encoding techniques to obfuscate malicious HTTP requests as normal-like HTTP
requests, while retaining the intention of the obfuscated malicious HTTP requests. However, the obfuscated characters might have potential meaning or functions behind the
tokens. URL encoding and unicode encoding are two types of encoding techniques widely
applied to web applications to encode/decode strings in a web page. Actually putting
text in a query might result in a confused URL. For example, a browser could send an
HTTP request with form data to a web server. The spaces are usually replaced with
the “+” string, and unsafe ASCII characters are also transformed into 8-bit hexadecimal
code character, with a “%” string placed before the encoded terms. The unicode encoding
technique collects the major languages spoken today in a unification unicode space. Unicode provides a unique number to represent every character across the operation system
platform. Attackers can utilize the Unicode encoding technique to replace malicious text
with Unicode to hide their malicious intention. Therefore, it is necessary to decode the
obfuscated characters to capture their semantic meaning.
2.2. CRFs-based token correlator. The CRF-based Token Correlator utilizes CRFs
to construct a CRF-based correlation model that represents the token correlation of the
token sequences. This module accepts the labeled token sequences as input from the
Token Sequence Generator. Let X be a random variable over the token sequences to be
labeled and Y is a random variable over the corresponding label sequences. There are two
advantages to using CRFs to model the token correlation from web script sequences. First,
CRFs can be used for a discriminative model that directly models conditional distribution
p(Y | X) instead of modeling joint distribution p(Y, X), as a generative model does. The
primary limitation of a generative model is that it requires the model marginal distribution
p(X). It is very difficult to estimate the marginal distribution because there is usually
a high dependency among the different tokens in token sequences X. This means that
when modeling p(X), it is necessary to enumerate all the possible token sequences. Yet,
often the amount of training data is insufficient. Second, CRFs determine the label of
the current token based on the past and future tokens instead of merely considering the
WEB MIMICRY ATTACK DETECTION
4353
current token or previous tokens. There are two components in this module, called the
Feature Template Configurator and CRF Model Constructor.
2.2.1. Feature template configurator. Feature Template Configurator intends to set the
feature template for constructing CRFs-based token correlation model according to expert
knowledge. Feature template describes the feature functions which are used to train CRFs
model. The feature function defines what kind of correlations we want to capture such
as relation of neighbor tokens or relation between two tokens with different distance. In
this paper, we mainly capture the correlation between neighbor tokens in token sequences
for constructing CRFs model. Figure 4 displays the feature used in Feature Template
Configuration. Figure 5 shows that the template is used for defining a regular format for
training and testing data.
Input data
Feature f ij for xi
j
0
1
i
xi
expanded
feature
yi
0
‘
‘
SQL injection
1
or
Key
SQL injection
2
1
Var
SQL injection
3
=
=
SQL injection
4
1
Var
SQL injection
5
-
-
SQL injection
6
-
-
SQL injection
j
Input data
Feature f ij for xi
0
1
expanded
xi
feature
i
yi
1
2
or [-2,0]
1 [-1,0]
Key [-2,1]
Var [-1,1]
SQL injection
SQL injection
3
= [0,0]
=
SQL injection
4
5
1 [+1,0]
- [+2,0]
Var [+1,1]
[+2,1]
[0,1]
SQL injection
SQL injection
Figure 4. Features used in templates configuration
The Feature Template Configurator sets the feature template for constructing a CRFbased token correlation model based on expert knowledge. A feature template describes
the feature functions that are used to train a CRF model. The feature function defines
the types of correlations we want to capture such as the relation of neighbor tokens or
relation between two tokens with different distances. In this paper, we mainly capture
the correlation between neighbor tokens in token sequences to construct a CRF model.
Figure 4 displays the feature used in the feature template configuration. Figure 5 shows
that the template is used for defining a regular format for training and testing data.
4354
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
Input Data
‘
or 1 = 1 - Feature Template
Feature f ij for xi
j
0
1
i
Xi (tokens)
Expanded
Feature
0
‘
1
[-3,0]
‘
Yi (labels)
[-3,1]
SQL injection
or [-2,0]
Key [-2,1]
SQL injection
2
3
1 [-1,0]
= [0,0]
Var [-1,1]
=
[0,1]
SQL injection
SQL injection
4
1 [+1,0]
Var [+1,1]
SQL injection
5
-
[+2,0]
-
[+2,1]
SQL injection
6
-
[+3,0]
-
[+3,1]
SQL injection
Figure 5. Example of feature template for training by CRFs. This template represent the proximity relations for training and testing data.
2.2.2. CRF model constructor. This component is capable of constructing a CRF-based
token correlation model and estimating the essential parameters of this CRF model. Given
a token sequence whose length is n, a random variable over the token sequence is defined
as X = (x1 , x2 , . . ., xn ), and a random variable over the corresponding label sequence
is denoted as Y = (y1 , y2 , . . ., yn ). CRFs provide a probabilistic graphical model for
calculating the probability of Y globally conditioned on X, as shown in Lafferty et al. [16].
We first define G = (V, E) as a graph composed of vertices V and edges E such that
Y = (Yv ), where Y ∈ v and Y is specified by the vertices of the graph G. (X, Y )
represents a CRF. The random variable Yv follows the Markov property with regard to
the graph when conditioned on X: p(Yv |X, Yw , w ̸= v) = p(Yv |X, Yw , w ∼ v), where w ∼ v
represent that w and v in G are neighbors. As in our proposed approach, we use a simple
sequence modeling. That is, the joint distribution over the label sequence Y given X has
the form:
(
)
∑
∑
µk gk (v, y|v , x) ,
(1)
pθ (y|x) ∝ exp
λk fk (e, y|e , x) +
e∈E,k
v∈V,k
where x is the token sequence, y is the corresponding label sequence, and y|s contains
the set of components of y that is related to the vertices or edges in a subgraph of S.
The features fk and gk are given in advance and fixed. For constructing the CRF model,
we need to find the essential parameters θ = (λ1 , λ2 , . . .; µ1 , µ2 , . . .) from training data
D = (xi , y i )N
i=1 according to the empirical distribution p̃(x; y).
2.3. Web mimicry attacks detector. The goal of the Web Mimicry Attacks Detector
is to label each token of a token sequence as normal or attack, with a confidence score of
prediction given to each label based on the CRF-based token correlation model. Then,
we can determine whether a token sequence is an attack or not according to the total
confidence score. This module accepts the predicted token sequences from the Token
Sequence Extractor as input. This module contains two components: the CRF-based
Token Sequence Labeler and Mimicry Attack Identifier.
WEB MIMICRY ATTACK DETECTION
4355
2.3.1. Token sequence labeler. The CRF-based Token Sequence Labeler is used to label
each token of a predicated token sequence as normal or attack, with a confidence score,
based on the CRF-based token correlation model. This component uses the CRF-base
token correlation model to determine label sequence Y corresponding to predicted token
sequence X with the highest probability by the Viterbi algorithm (MAP) [6]. In the
meantime, CRFs-based token correlation model would estimate the likelihood value of
each label for a predicted token and then the label with the maximum likelihood would
be the label of that predicted token with a confidence score.
2.3.2. Mimicry attacks identifier. The Mimicry Attack Identifier accepts both token sequences and the corresponding confidence scores as input for determining whether the
incoming HTTP request is normal or an attack. Since the CRF-based Token Correlator gives a confidence score for each category (normal and attacks), we propose an attack
identifying mechanism that uses the confidence score for categorizing the incoming HTTP
token sequence. This mechanism determines the label with the maximum total confidence
score. The notations are described as follows, and the pseudo code is shown in Algorithm
2.
Notations Given label sequence S = (s1 , s2 , . . ., s|s| ), which corresponds to the predicted
token sequence, | S | is the length of the label sequence. ci represents the confidence score
of the ith label in the label sequence. N corresponds to | A | types of attack labels in the
label sequence, so (a1 , a2 , . . ., a|A| ) represents these attack labels. Ck represents the sum
of confidence scores of the k th attack in the label sequence.
In Algorithm 2, the input is a token sequence from the Token Sequence Labeler, and
each token of the token sequence has a corresponding label and confidence score. The
output would determine the label for the entire token sequence. We utilize the confidence
scores of the label sequence to calculate the total confidence scores of the labels that
appear in the label sequence, and then determine the relevant label as the one with the
maximum total confidence score. If all the labels for each token in a token sequence are
normal, the entire token sequence under the HTTP request could be regarded as normal.
Otherwise, we calculate all the total confidence scores of the attack labels that exist in
the token sequence. To get the total confidence score for an attack label, we sum the
confidence scores of each type of attack label in the token sequence. Finally, we set the
4356
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
label for the entire token sequence as the attack label whose total confidence score is the
maximum of all the attack labels in the token sequence. Figure 6 gives an example for
Algorithm 2.
Label Sequence
N
N
N
N
N
Token Sequence
T1
T2
T3
T4
T5
NORMAL
(a)
confidence scores
f1
f2
f3
f4
Label Sequence
N
A1
A1
A2
A2
N
Token Sequence
T1
T2
T3
T4
T5
T6
A2
f1 + f2 < f3 + f4
(b)
Figure 6. Two cases of decision mechanism for determining the label for
entire token sequence from Algorithm 2. (a) If labels for all the tokens
are normal, label for entire token sequence is normal. (b) Label of entire
token sequence considered as attack label whose total confidence score is
maximum of total confidence scores of all attack labels in token sequence.
3. Experiment Design and Results. This section discusses four experiments that
were performed to evaluate the effectiveness of our proposed approach for identifying
web mimicry attacks. We found that in insertion web mimicry attacks with different
insertion ratios and general web attacks, the proposed approach is effective and better than
baseline methods. We also compared the proposed method with some baseline methods
such as TFIDF-naı̈ve Bayes (TFIDF-NB) [3,10] and the hidden Markov model (HMM)
[17] to analyze the strength of the proposed method from the perspective of its profiling
capability. The three goals for the experiments used for evaluating the performance
of our proposed approach are described as follows. First, we intended to evaluate the
performance on insertion based web mimicry attacks. Meanwhile, different insertion ratios
were analyzed to determine whether they would be helpful for evading detection under
our approach. Finally, we demonstrated that the proposed approach was not only helpful
for insertion based web mimicry attacks but also for general web attacks.
3.1. Datasets. Two datasets were used for evaluations in the experiments: PKDD07
and a dataset obtained from Chunghwa Telecom Laboratories 1 . First, the performance of
our proposed approach when detecting PKDD 2007’s “Analyzing Web Traffic challenge”
(PKDD07) dataset [4] was evaluated. The PKDD 2007 dataset extracted from HTTP
query logs is composed of 50000 pre-classifier examples in a valid (normal query) category
and 7 attack categories. Each example in the dataset is completely independent of the
others and has a unique id, context (describes what environment the query runs in), class
(the category for the sample as classified by an expert), and the content of the query
itself. Additionally, the important element “attackInterval” indicates the location of the
attack sequences. There are eight types of labels in the PKDD07 dataset including one
valid and seven types of attacks: (1) normal query (Valid), (2) cross-site scripting (XSS),
(3) SQL injection (SQL), (4) LDAP injection (LDAP), (5) XPATH injection (XPATH),
(6) path traversal (PATH), (7) command execution (OSC) and (8) SSI attacks (SSI).
1
Chunghwa Telecom, http://www.cht.com.tw/
WEB MIMICRY ATTACK DETECTION
4357
PKDD07 dataset was divided, with 70% (# 24504) used for a training set and 30%
(# 10502) for a testing set. The training set was used as learning data to construct our
proposed model. The testing set was used to generate three types of test datasets D1,
D2 and D3 for effectiveness evaluations. The data in D3 was extracted from PKDD07
to represent general web attacks. D1 contained samples that were identified as insertion
web mimicry attacks by domain knowledge from D3. D2 contained samples whose class
was SQL Injection from D1 and were crafted as insertion-type SQL Injection attacks by
utilizing different insertion ratios. Table 2 shows the distribution of each dataset in our
experiments.
Table 2. Distribution of datasets (D2 is the number of interjection attacks
and D1 is the number of testing instances) (LDAP: LdapInjection, OSC:
OSCommand, PATH: Path Transversal, SQL: SqlInjection, XPATH:
XPathInjection)
Type Total Number Training Number (D1) (D2)
Valid
35006
24504
10502
0
LDAP
2279
1596
683
0
OSC
2254
1578
676 1036
PATH
2262
1584
678 1764
SQL
2015
1411
604
238
SSI
16
12
4
0
XPATH
2279
1596
683
0
XSS
1825
1278
547
0
The dataset from Chunghwa Telecom Laboratories was obtained from the access logs
of the web servers in a telecom company called Chunghwa Telecom Laboratories. We
took the ordinary access logs as normal training and testing datasets. Some of tools
and websites that were used to generate the access logs of XSS attacks provided the
functionality of producing XSS attack strings. For example, CAL9000 [18] is an OWASP
project and provides a collection of web application security testing tools. It provided the
flexibility and functionality that we needed to generate the XSS attacks. On the other
hand, RSnake [19] provides an XSS cheat sheet, which contains numerous XSS attack
strings. We wrote some web pages, launched the XSS attacks, and got the access logs of
the XSS attacks.
3.2. Performance measurements and tools. A confusion matrix [14] was used to
measure the precision and recall in our experiments. A confusion matrix contains classified results related to the actual and test outcomes from a classification model. In our
experiment, CRF toolkit, CRF++ [2], was utilized to implement the proposed approach.
In order to make a comparison with other methods, we utilized an open source WEKA
[9] written by Java and HMM Toolbox [17] based on Matlab to evaluate the performance.
3.3. Experiment results. Four experiment results of finding web mimicry attacks and
a comparison with baseline methods are shown in this section. The first experiment
evaluated the accuracy of identifying out insertion web mimicry attacks. The second
experiment showed that the proposed approach can identify web mimicry attacks with
different insertion ratios. The third experiment showed that the proposed approach was
also effective for general web attacks. The fourth experiment demonstrated the experiment
result for the Chunghwa Telecom dataset.
4358
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
3.3.1. Effectiveness analysis. This experiment compared the proposed approach with TFI
DF-NB and HMM on the insertion web mimicry attack dataset D1. Figure 7 shows the
precision and recall rates of our proposed approach and the other methods on the insertion
web mimicry attack dataset. Our approach is obviously effective for identifying insertion
mimicry attacks and better than the other methods. The experiment results show that
using CRF to model the sequence structure of HTTP requests is helpful for identifying
insertion mimicry attacks, especially in OSCommand, PathTraversal and SQLInjection
with more complex attack mechanisms.
1
Precision
0.8
0.6
0.4
0.2
0
Ours
TFIDF−NB
HMM
ZERO
Oscommand Pathtraversal
Sql injection
(a)
1
Recall
0.8
0.6
0.4
0.2
0
Ours
TFIDF−NB
HMM
ZERO
Oscommand Pathtraversal
Sql injection
(b)
Figure 7. Effectiveness analysis of identifying insertion web mimicry attacks: (a) precision and (b) recall. Our proposed method performed better
than the other two baseline approaches (i.e., naı̈ve Bayes and hidden Markov
model) in different types of web application attacks.
3.3.2. Tolerance analysis for insertion. The goal of this experiment was to show the tolerance of our method for different insertion ratios using the data from D2. We randomly
WEB MIMICRY ATTACK DETECTION
4359
inserted normal characters of different ratios into attack strings extracted from the SQL
Injection samples of web mimicry attacks from D1.
The insertion ration is the length proportion of the inserted string to the original
attack string. We verified the performance for three insertion ratios. WEKA and the
HMM toolbox were used to obtain TFIDF-NB and HMM results. Figure 8 shows the
recall rate of the proposed approach and the other methods on D2, which contained
different insertion ratios. Based on the experimental results, we can obviously reach two
conclusions. First, our approach maintains higher performance than the other methods
in precision and recall measurements. Second, the performance decline of the proposed
approach was smaller than that of the other methods, which means high ratio interjection
had a smaller impact on the proposed approach that it did on the others.
1
Recall
0.9
0.7
0.5
0.45
Ours
TFIDF−NB
HMM
pad1
pad5
pad10
Figure 8. Recall rate of identifying web mimicry attacks, where meaningless characters were inserted at different tations
3.3.3. Performance analysis in general web attacks. In this experiment, we intended to
show that the proposed approach was not only effective at identifying insertion web
mimicry attacks but also general web attacks using D3. We also utilized a confusion
matrix to evaluate the performance of proposed approach, TFIDF-NB and HMM on D3.
Figure 9 shows the precision and recall rates of the proposed approach and the other
methods on eight types of data samples in testing data D3. The proposed approach was
more effective than the other methods at identifying general web attacks.
3.3.4. Performance analysis on real dataset. In this experiment, we intended to evaluate
the proposed approach using a real world dataset. Figure 10 shows that the proposed
approach could identify all the XSS attacks in the Chunghwa Telecom dataset. Our
approach can model the structure of normal data and malicious attacks to distinguish
between XSS attacks and benign web applications. Although the recall performance of
HMM was as good as our proposed mechanism, our approach still performed significantly
well compared to HMM when considering the forward and backward causal correlation.
The naı̈ve Bayes model still had the lowest precision and recall, demonstrating that
only considering the occurrence of script tokens could not clearly differentiate attacks
from benign applications in web script sequences.
4. Discussion. We discuss the computation and deployment analysis for our proposed
approach in more detail.
4360
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
1
0.9
0.8
0.7
Ours Approach
TF−IDF+NB
HMM
0.6
0.5
Valid
OSC
Path
SQL
XSS
Figure 9. F-measure for identifying general web attacks. The average
performance of the proposed approach was better than that of the baseline
methods for detecting general web attacks.
4.1. Computation analysis. The computation in the proposed approach includes two
parts, one is CRF token correlation training time, and the other one is for token extraction.
The CRF with discriminative version can also be efficiently done in O(ml3 ) where m is
the number of production of the grammar and l is the length of the sequence. In this
application, the token length of HTTP request/response is around 5 to 10 in average. For
the number of production of the grammar, that is, the token size (alphabet) in our work
is around 40 50. Therefore, it would not give a heavy loading or be impractical in real
environment. In another point of view, the token extraction process just one-pass process
(linear computation, O(n)) for replacing the matched scripts or symbols.
4.2. Deployment analysis. We discuss the deployment conditions of our proposed approach in two views, one is the quality of ground truth for learning, and the other view
is the detection location for data collection. The quality of ground truth is the critical
point for learning quality. In this work, we apply both PKDD2007 benchmark and one
real world dataset for analysis. The result reflect that either public benchmark or real
world dataset, the proposed method could tolerant the environment divergence to achieve
better result than other baseline approach. For the detection location of data collection,
the proposed approach only inspect the request/response messages of HTML, therefore,
it is easily collect the inspect data.
5. Conclusion. In this paper, we proposed a detection mechanism for web mimicry
attacks, which exploits the advantage of both novel web script token extraction and
insertion-tolerated CRF mechanisms. In this method, incoming HTTP requests are segmented into a sequence of tokens, and CRFs are used to model the token correlation, especially to identify the insertion tricks of mimicry attacks. The proposed method requires
only HTTP request information and can be easily plugged into web intrusion detection
systems for sound protection on web applications. To the best of our knowledge, this is
the first anomaly detection system design for insertion-based web mimicry attacks.
We tested our approach on the PKDD 2007 dataset and a real world dataset and found
that the proposed approach has both high precision and recall rates in detecting web
mimicry attacks with different insertion ratios. This method also significantly improves
the current web attack detection mechanism when a sequence of web scripts is used. The
WEB MIMICRY ATTACK DETECTION
4361
XSS Attack Detection
1
Precision
0.8
0.6
0.4
0.29
Ours
HMM
TDIDF−NB
(a)
XSS Attack Detection
1
Recall
0.98
0.96
0.93
Ours
HMM
TDIDF−NB
(b)
Figure 10. Precision and recall rates for identifying XSS attacks. The
proposed approach could identify all the XSS attacks in the Chunghwa
Telecom dataset because the approach can model the structure of normal
data and malicious attacks to distinguish XSS attacks.
results of the evaluation using both public data and real world data indicated stable performance. Therefore, ideally, the system will not require any pre-configuration, although
the level of sensitivity to anomalous data has to be automatically adjusted according to
the network environment (e.g., diversity of web scripts). Our proposed approach is also
suitable for modeling web traffic and is capable of identifying both web mimicry attacks
and general web attacks.
REFERENCES
[1] C. C. Chang and L.-W. Sung, A computer maintenance expert system based on web services, ICIC
Express Letters, vol.3, no.4(B), pp.1209-1214, 2009.
[2] CRF++: Yet Another CRF Toolkit, 2006.
4362
C.-H. MAO, H.-M. LEE, E.-S. LIU AND K.-P. WU
[3] P. Domingos and M. J. Pazzani, On the optimality of the simple bayesian classifier under zero-one
loss, Machine Learning, vol.29, pp.103-130, 1997.
[4] ECML/PKDD 2007’s Analyzing Web Traffic Challenge, 2007.
[5] R. Fielding, J. Gettys, J. C. Mogul, H. Frystyk, L. Masinter, P. Leach and T. Berners-Lee, Hypertex
Transfer Protocol, HTTP/1.1. 1998.
[6] G. D. Fornay, The viterbi algorithm, Proc. of the IEEE, vol.61, pp.268-278, 1973.
[7] G. T. Raju and L. M. Patnaik, Knowledge discovery from web usage data: Extraction and applications of sequential and clustering patterns – A survey, International Journal of Innovative
Computing, Information and Control, vol.4, no.2, pp.381-390, 2008.
[8] D. Gao, M. K. Reiter and D. X. Song, Behavioral distance measurement using hidden Markov models,
Proc. of the 9th International Symposium on Recent Advances in Intrusion Detection, pp.19-40, 2006.
[9] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd
Edition, Morgan Kaufmann, 2005.
[10] K. S. Jones, A statistical interpretation of term specificity and its application in retrieval, Document
Retrieval Systems, pp.132-142, 1988.
[11] K. L. Ingham and H. Inoue, Comparing anomaly detection techniques for http, Proc. of the 10th
International Symposium on Recent Advances in Intrusion Detection, pp.42-62, 2007.
[12] K. L. Ingham, A. Somayaji, J. Burge and S. Forrest, Learning DFA representations of HTTP for
protecting web applications, The International Journal of Computer and Telecommunications Networking, vol.51, pp.1239-1255, 2007.
[13] M. Kiani, A. Clark and G. Mohay, Evaluation of anomaly based character distribution models in
the detection of sql injection attacks, Reliability and Security, ARES, pp.47-55, 2008.
[14] R. Kohavi and F. Provost, Glossary of terms, Machine Learning, vol.30, 1998.
[15] C. Kruegel, G. Vigna and W. Robertson, A multi-model approach to the detection of web-based
attacks, The International Journal of Computer and Telecommunications Networking, vol.48, pp.717738, 2005.
[16] J. Lafferty, A. McCallum and F. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proc. of the 18th International Conference on Machine Learning,
pp.282-289, 2001.
[17] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition,
Proc. of the IEEE, vol.77, pp.257-286, 1989.
[18] OWASP, Owasp Cal9000 Project.
[19] RSnake. Xss (Cross Site Scripting) Cheat Sheet Esp: For Filter Evasion.
[20] Y. Song, S. Stolfo and A. Keromytis, Spectrogram: A mixture-of-markov-chains model for anomaly
detection in web traffic, The 16th Annual Network and Distributed System, 2009.
[21] D. Wagner and P. Soto, Mimicry attacks on host-based intrusion detection systems, Proc. of the 9th
ACM Conference on Computer and Communications Security, pp.255-264, 2002.
[22] K. Wang, J. J. Parekh and S. J. Stolfo, Anagram: A content anomaly detector resistant to mimicry
attack, Proc. of the 9th International Symposium on Recent Advances in Intrusion Detection, pp.226248, 2006.
[23] W.-J. Shyr, K.-C. Yao, D.-F. Chen and C.-Y. Lu, Web-based remote monitoring and control syste
m for mechatronics module, ICIC Express Letters, vol.3, no.3(A), pp.367-372, 2009.
Download