Một cách nhìn khác về thuật toán chọn lọc âm tính dựa trên bộ dò R

advertisement
Nguyễn Văn Trường và Đtg
Tạp chí KHOA HỌC & CÔNG NGHỆ
102(02): 45 - 49
ANOTHER LOOK AT R-CHUNK DETECTOR-BASED NEGATIVE
SELECTION ALGORITHM
Nguyen Van Truong1*, Trinh Van Ha2
2
1
College of Education – TNU
College of Information and Telecommunication Technology - TNU
SUMMARY
Artificial immune system (AIS) is a diverse research area that combines the disciplines of
immunology and computation. Negative selection algorithm (NSA) is one of the computational
models of self/nonself discrimination can be designed for anomaly detection in AIS. It contains
two stages: generate a set D of detectors that do not match any element of a given self-set S. Then,
use these detectors to detect whether a given cell is self or nonself. One fast r-chunk detector-based
NSA (rNSA) originally introduced by M. Elberfeld et al. in 2009 [6], the complete generating
detector can detect all nonself space. Here, we develop negative-dual algorithm, called r-chunk
detector-based positive selection algorithm (rPSA), to detect the complement of the nonself space
with the same memory complexity but reduces runtime complexities.
Keywords: Artificial immune system, negative selection algorithm, positive selection algorithm,
computer security, r-chunk detector.
INTRODUCTION*
AIS is inspired by the observation of the
behaviors and the interaction of normal
component of biological systems - the self and abnormal one - the nonself. Real immune
system generates T cells randomly with the
ability to detect harmful antigens. The
receptors of new born T cells are assembled
from combined gene fragments. In an organ
called the thymus, the T cells are then
exposed to proteins from self, and cells whose
receptors match such a self protein are bound
to die. Only those that survive negative
selection may leave the thymus, and use their
receptors to screen the organism for nonself
proteins. This process is known as negative
selection and is applicable of computer
security. An algorithmic abstraction of this
biological process is called a NSA.
NSA has been used successfully both in
engineering applications and by naturally
occurring biological systems like human. This
algorithm learns to distinguish a set of
normally occurring patterns (self) from its
complement (nonself) when only positive
instances of the class are available. For
example, it can distinguish safe data from
*
Tel: 0915 016063, Email: nvtruongtn@gmail.com
noise data or even normal processes in a
computer from the others, etc. There are
many well known change-detection and
check-sum algorithms that solve a restricted
form of the anomaly-detection problem, such
as MD5 or SHA algorithms. Here, it assumes
that self is known exactly, is small enough to
be stored in a single location, remains
constantly, and can be unambiguously
distinguished from nonself. However, for
cases in which these assumptions do not hold,
the discrimination task is more challenging,
and in these situations, the NSA may be
appropriate.
The outline of a typical NSA contains two
stages [2]. In the generation stage (Fig. 1), the
detectors are generated by some random
process and censored by trying to match
given self samples taken from set S. Those
candidates that match are eliminated and the
rest are kept as detectors in set D. In the
detection stage (Fig. 2), the collection of
detectors (or detector set) is used to verify
whether an incoming data instance is self or
nonself. If it matches any detector, it is
claimed as nonself or anomaly. Each detector
will cover (match) a subset of the nonself set.
By generating sufficient numbers of
independent detectors, good coverage of the
nonself set will be obtained.
45
Nguyễn Văn Trường và Đtg
Tạp chí KHOA HỌC & CÔNG NGHỆ
Begin
Generate random
candidates
Yes
Match self
samples?
No
Accept as new detector
No
Enough detectors?
Yes
End
Figure 1. Model of negative detector generation
Begin
Input new samples
Yes
Match any
detector?
No
“Nonself”
“Self”
End
Figure 2. Negative detection of new instances
The negative r-chunk and r-contiguous
detectors considered are among the most
common ones in the AIS literature. The
negative r-contiguous detectors are originally
researched by many authors, and negative rchunk detectors were later introduced to
achieve better results on data where adjacent
regions of the input strings are not necessarily
semantically correlated, such as network data
packets [4], [10].
All existing NSAs suffer from a worst-case
exponential size of D in the total size of the
input and therefore limit their practical
applicability. Our contribution is to develop
an r-chunk detector-based positive selection
46
102(02): 45 - 49
that is equivalent presentation of r-chunk
detector-based negative selection in term of
performance. Our algorithm can used to cover
the complement of the nonself space. This
reduces the overall runtime significantly
(Table 1). Moreover, our new approach
enables us to extend our algorithm efficiently
to real problems, where the set of positive
instances is much smaller than the set of
complementary ones.
The remaining of the paper is organized as
follows: In the next section, we define rchunk detector types. The subsequent section,
the main part of the paper, shows our rPSA.
In the last section, we summarize our
approach and discuss the future work.
POSITIVE AND NEGATIVE STRINGBASED DETECTORS
In this paper, we consider rNSA and rPSA as
a classifier operating on a binary string space
Σℓ, where Σ = {0, 1}. The limited alphabet Σ
here is just for understanding the approach;
our algorithm can be easily adjusted to real
world datasets on arbitrary alphabets. We also
use the following notation: Let s ∈ Σℓ be a
binary string. Then ℓ = |s| is the length of s
and s[i,…,j] is the substring of s with length j
– i + 1 that starts at position i.
Definition 1 (Chunk detectors). An r-chunk
detector (d, i) is a tuple of a string d ∈ Σr and
an integer i ∈ {1,…, ℓ - r + 1}. It matches
another string s ∈ Σℓ if s[i,…, i + r - 1] = d.
Definition 2 (Positive chunk detectors). Given
a self set S, an r-chunk detector (d, i) is a
positive chunk detector if it matches a
substring s[i,…, i + r - 1] of s, s ∈ S.
Definition 3 (Negative chunk detectors).
Given a self set S, an r-chunk detector (d, i) is
a negative chunk detector if it does not matches
any substring s[i,…, i + r - 1] of s, s ∈ S.
Example 1 shows two detector types
generated from a given self set. The example
is used several times in the paper.
Example 1. Given a self set S has 6 binary
strings, with ℓ = 5 and r = 3: S = {s1 = 00000;
s2 = 00010; s3 = 10110; s4 = 10111; s5 =
11000; s6 = 11010}. The set of all negative 3-
Nguyễn Văn Trường và Đtg
Tạp chí KHOA HỌC & CÔNG NGHỆ
chunk detectors that includes (ℓ - r + 1)
subsets is Dn = Dn1 ∪ Dn2 ∪ Dn3 with Dn1 =
{(001,1); (010,1); (011,1); (100,1); (111,1)},
Dn2 = {(010,2); (110,2); (111,2)} and Dn3 =
{(001,3); (011,3); (100,3); (101,3)}; The set
of all positive 3-chunk detectors is Dp = Dp1
∪ Dp2 ∪ Dp3 with Dp1 = {(000,1); (101,1);
(110,1)}, Dp2 = {(000,2); (001,2); (011,2);
(100,2); (101,2)} and Dp3 = {(000,3); (010,3);
(110,3); (111,3)}. It can see that Dni and Dpi
is the complement of each other in {0, 1}3, i =
1, 2, 3.
For self and nonself discrimination, all
researchers focus only on rNSA: if a given
cell s matches any negative r-chunk detector,
it is nonself. Our approach can be considered
as a dual-rNSA: if a given cell s does not
match all positive r-chunk detector, it is
nonself. This simple idea leads to our
interesting algorithm rPSA described in the
following section.
R-CHUNK DETECTOR-BASED PSA
Our approach is derived from previous work
on rNSA, in [6] the authors use suffix pattern as
main data structure. We use here binary tree as
data structure for both generation and detection.
Our algorithm is first construct (ℓ - r + 1)
binary trees Ti corresponding (ℓ - r + 1) Dpi
sets, i = 1,…, (ℓ - r + 1). Then we delete all
complete sub-trees of these trees to achieve a
compact representation of the positive rchunk detector set. The detection phase can
be operated by traveling the adjusted trees
iteratively one by one.
Table 1. Comparison of our results with the run
times of previously published algorithms
rNSA
Generating
Detecting
phase
phase
r
[9]
(2 + |S|)(ℓ - r + 1)
|D|ℓ
[6]
|S|(ℓ - r + 1)r2
|S|l2r
Ours
|S|(ℓ - r + 1)r
(ℓ - r + 1)r
In Table 1, the parameter |D| is the number of
detectors. Our algorithm and the algorithm in
[6] produce the results that would be
equivalent to the maximal number of
generated detectors. But our algorithm’s
102(02): 45 - 49
complexities are much smaller than the
others. Moreover, in comparison to our
previous research on rNSA [8], our current
rPSA firstly does not need to create new
nodes in binary tree so it runs quite faster and
be easier to implement. Secondly, the rPSA
can be updated set D easily and naturally in
dynamic environment where set S changes
over real time.
For the Example 1, Figure 3.a, 3.b, 3.c
illustrates the binary tree T1, T2, T3 built from
Dp1, Dp2, and Dp3 respectively. In the figure,
the dash arrows present sub-trees will be
deleted. Moreover, the left child is labeled
with 0 and the right one labeled with 1
implicitly.
a. Tree T1
c. Tree T2
e. Tree T3
Figure 3. Binary trees constructed
from Dpi, i = 1, 2, 3.
Our efficient algorithm for positive selection
with r-chunk detectors is presented as bellow.
Procedure CHUNK_DETECTOR_PSA
Input: a self set S, an integer r ∈ {1,…,ℓ} and
a cell string s* to be detected.
Output: detection of s* as self or nonself.
begin
for i = 1 to ℓ - r + 1 do
47
Nguyễn Văn Trường và Đtg
Tạp chí KHOA HỌC & CÔNG NGHỆ
begin
initialize an empty binary tree Ti;
for each s ∈ S do
insert s[i,…, r +i-1] into Ti;
for every non-leaf node n ∈ Ti do
if n is root of complete binary sub-tree then
delete this sub-tree;
end;
if s* does not match any concatenation of
labels from root of Ti, i = 1,.., (ℓ - r + 1), to a
leaf then
output “s* is nonself”
else
output “s* is self”;
end;
The procedure of generating a compact
representation of complete r-chunk detector is
produced by the outer loop for. The binary
tree Ti is constructed in the first inner loop,
and the deletion of Ti is completed by the
second one, i = 1,…, ℓ - r + 1. The procedure
of detecting if a given cell string s* is self or
nonself is done by the last if… then… else
statement.
For example, given S and r as mentioned in
Example 1, and s* = 10100 is the input of the
algorithm. Then three binary trees are
constructed as in Figure 3. The output of the
algorithm is declaration “s* is nonself”
because the all paths of T2 does not contains
sub-string of labels s*[2...4] = 010 of s*.
We use binary tree as main data structure that
constructed from self set S that impacts on
time complexities. It is very easy to proof that
it takes time |S|(ℓ - r + 1).r to generate all
necessary trees and (ℓ - r + 1).r to verify a cell
string as self or as nonself.
The following table compares our results with
run times of the algorithm published in 2009
102(02): 45 - 49
[6] on some inputs. On average our rPSA’s
complexities are 13.5 times faster and
31,258,639 times faster than rNSA’s
complexities in [6] for generating phase and
detecting phase, respectively. The last input
data (in the last row) is quite big and suitable
in real problems, for example the length of IP
package string is 49 bits. It shows clearly in
the last row that for these applications, our
algorithms can detect abnormal cell nearly
immediately, while the rNSA in [6] may take
one minute or more (equivalently to needed
time to operate 75 billion operations in a
normal computer) for the same task.
CONCLUSIONS
Our efficient approach reduces time
complexities of the two phases, generation
and detection, to polynomial time. Moreover,
our rapid algorithm is easy and natural to
implement in computer. This helps to build up
large scale AISs with huge data space. In the
future, we plan to report more detail
experimental data about the algorithm on
virus, spam [7] [10] and standard database of
network attacks, i.e. KDD CUP’99 data set.
We guest that our rPSA can be combined
efficiently with rNSA to have more flexible
data structure for dynamic environment.
ACKNOWLEDGMENT
This work was funded by the Vietnam's
National Foundation for Science and
Technology Development (NAFOSTED) via
a research grant for fundamental sciences,
grant number: 102.01-2010.09, by the Thai
Nguyen University for university’s research,
code number DH2011-04-26, by College of
Information
and
Telecommunication
Technology’s research, and by Ha Noi
University’s research. We would like to thank
the Management Boards of these projects.
Table 2. Comparison of our results with the run times of the algorithm in [6]
48
|S|
ℓ
r
20
1,000
100,000
1,000,000
10
20
45
50
4
8
12
30
Generating phase
Ours
[6]
560
2,240
104,000
832,000
40,800,000
489,600,000
630,000,000
18,900,000,000
Detecting phase
Ours
[6]
28
8,000
104
3,200,000
408
2,430,000,000
630
75,000,000,000
Nguyễn Văn Trường và Đtg
Tạp chí KHOA HỌC & CÔNG NGHỆ
REFERENCES
[1]. Fernando Esponda et al., A Formal
Framework for Positive and Negative Detection
Schemes, IEEE transactions on systems, man, and
cybernetics, 34 (1), 2004.
[2]. Forrest et al., Self-Nonself Discrimination in
a Computer, in Proceedings of 1994 IEEE
Symposium on Research in Security and Privacy,
Oakland, CA, 202-212, 1994.
[3]. Jamie Twycross et al., Detecting Anomalous
Process Behavior using Second Generation
Artificial
Immune
Systems,
Journal
of
Unconventional Computing, Vol. 1, pp. 1–26,
2010.
[4]. J. Balthrop et al., Coverage and
generalization in an artificial immune system,
GECCO 2002, 3-10, 2002.
[5]. L. N de Castro and J. Timmis, Artificial
Immune Systems: A New Computational
Intlligence Approach, Springer-Verlag, 2002.
102(02): 45 - 49
[6]. M. Elberfeld, J. Textor, Efficient algorithms
for string-based negative selection, Proceedings of
the 8th International Conference on Artificial
Immune Systems, LNCS 5666, 109-121, 2009.
[7]. Nguyen Van Truong et al., A fast r-chunk
detector-based negative selection algorithm,
Journal of Science and Technology, Thai Nguyen
University, 2(90), 2012, 55-58.
[8]. Patrik D’haeseleer et al., An Immunological
Approach to Change Detection: Algorithms,
Analysis and Implications, IEEE Symposium on
Security and Privacy, 1996.
[9]. T. Stibor et al., An investigation of r-chunk
detector generation on higher alphabets, GECCO
2004, LNCS 3102, 299-307, 2004.
Zhou Ji et al., Revisiting Negative Selection
Algorithms, Evolutionary Computation 15(2),
2007, 223-251.
TÓM TẮT
MỘT CÁCH NHÌN KHÁC VỀ THUẬT TOÁN CHỌN LỌC ÂM TÍNH
DỰA TRÊN BỘ DÒ R-CHUNK
Nguyễn Văn Trường1*, Trịnh Văn Hà2
1
2
Trường Đại học Sư phạm - ĐH Thái Nguyên
Trường Đại học Công nghệ thông tin và Truyền thông - ĐH Thái Nguyên
Hệ miễn dịch nhân tạo là một lĩnh vực nghiên cứu phong phú kết hợp các nguyên lý miễn dịch học
và tính toán. Thuật toán chọn lọc âm tính là một trong số những mô hình tính toán phổ biến về
phát hiện self/nonself có thể được dùng cho phát hiện bất thường trong hệ miễn dịch nhân tạo. Nó
bao gồm hai giai đoạn: sinh một tập D các bộ dò mà không khớp được với bất kỳ phần tử nào của
một tập self cho trước S. Sau đó, sử dụng những bộ dò này để phân biệt một tế bào cho trước là
self hay nonself. Một thuật toán chọn lọc âm tính nhanh dựa trên bộ dò loại r-chunk lần đầu được
giới thiệu bởi tác giả M. Elberfeld và cộng sự vào năm 2009 [6], tập bộ dò đầy đủ được sinh ra có
thể phát hiện được toàn bộ không gian nonself. Trong bài báo này, chúng tôi phát triển thuật toán
đối ngẫu âm tính, gọi là thuật toán chọn lọc dương tính sinh tập bộ dò dựa trên r-chunk, có thể phát
hiện được phần bù của không gian nonself có cùng độ phức tạp bộ nhớ nhưng giảm độ phức tạp
thời gian.
Từ khóa: Hệ miễn dịch nhân tạo, thuật toán chọn lọc âm tính, thuật toán toán chọn lọc dương
tính, an ninh máy tính, bộ dò r-chunk.
Ngày nhận bài:3/1/2013, ngày phản biện:24/1/2013, ngày duyệt đăng:26/3/2013
*
Tel: 0915 016063, Email: nvtruongtn@gmail.com
49
Download