Nguyễn Văn Trường và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 102(02): 45 - 49 ANOTHER LOOK AT R-CHUNK DETECTOR-BASED NEGATIVE SELECTION ALGORITHM Nguyen Van Truong1*, Trinh Van Ha2 2 1 College of Education – TNU College of Information and Telecommunication Technology - TNU SUMMARY Artificial immune system (AIS) is a diverse research area that combines the disciplines of immunology and computation. Negative selection algorithm (NSA) is one of the computational models of self/nonself discrimination can be designed for anomaly detection in AIS. It contains two stages: generate a set D of detectors that do not match any element of a given self-set S. Then, use these detectors to detect whether a given cell is self or nonself. One fast r-chunk detector-based NSA (rNSA) originally introduced by M. Elberfeld et al. in 2009 [6], the complete generating detector can detect all nonself space. Here, we develop negative-dual algorithm, called r-chunk detector-based positive selection algorithm (rPSA), to detect the complement of the nonself space with the same memory complexity but reduces runtime complexities. Keywords: Artificial immune system, negative selection algorithm, positive selection algorithm, computer security, r-chunk detector. INTRODUCTION* AIS is inspired by the observation of the behaviors and the interaction of normal component of biological systems - the self and abnormal one - the nonself. Real immune system generates T cells randomly with the ability to detect harmful antigens. The receptors of new born T cells are assembled from combined gene fragments. In an organ called the thymus, the T cells are then exposed to proteins from self, and cells whose receptors match such a self protein are bound to die. Only those that survive negative selection may leave the thymus, and use their receptors to screen the organism for nonself proteins. This process is known as negative selection and is applicable of computer security. An algorithmic abstraction of this biological process is called a NSA. NSA has been used successfully both in engineering applications and by naturally occurring biological systems like human. This algorithm learns to distinguish a set of normally occurring patterns (self) from its complement (nonself) when only positive instances of the class are available. For example, it can distinguish safe data from * Tel: 0915 016063, Email: nvtruongtn@gmail.com noise data or even normal processes in a computer from the others, etc. There are many well known change-detection and check-sum algorithms that solve a restricted form of the anomaly-detection problem, such as MD5 or SHA algorithms. Here, it assumes that self is known exactly, is small enough to be stored in a single location, remains constantly, and can be unambiguously distinguished from nonself. However, for cases in which these assumptions do not hold, the discrimination task is more challenging, and in these situations, the NSA may be appropriate. The outline of a typical NSA contains two stages [2]. In the generation stage (Fig. 1), the detectors are generated by some random process and censored by trying to match given self samples taken from set S. Those candidates that match are eliminated and the rest are kept as detectors in set D. In the detection stage (Fig. 2), the collection of detectors (or detector set) is used to verify whether an incoming data instance is self or nonself. If it matches any detector, it is claimed as nonself or anomaly. Each detector will cover (match) a subset of the nonself set. By generating sufficient numbers of independent detectors, good coverage of the nonself set will be obtained. 45 Nguyễn Văn Trường và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ Begin Generate random candidates Yes Match self samples? No Accept as new detector No Enough detectors? Yes End Figure 1. Model of negative detector generation Begin Input new samples Yes Match any detector? No “Nonself” “Self” End Figure 2. Negative detection of new instances The negative r-chunk and r-contiguous detectors considered are among the most common ones in the AIS literature. The negative r-contiguous detectors are originally researched by many authors, and negative rchunk detectors were later introduced to achieve better results on data where adjacent regions of the input strings are not necessarily semantically correlated, such as network data packets [4], [10]. All existing NSAs suffer from a worst-case exponential size of D in the total size of the input and therefore limit their practical applicability. Our contribution is to develop an r-chunk detector-based positive selection 46 102(02): 45 - 49 that is equivalent presentation of r-chunk detector-based negative selection in term of performance. Our algorithm can used to cover the complement of the nonself space. This reduces the overall runtime significantly (Table 1). Moreover, our new approach enables us to extend our algorithm efficiently to real problems, where the set of positive instances is much smaller than the set of complementary ones. The remaining of the paper is organized as follows: In the next section, we define rchunk detector types. The subsequent section, the main part of the paper, shows our rPSA. In the last section, we summarize our approach and discuss the future work. POSITIVE AND NEGATIVE STRINGBASED DETECTORS In this paper, we consider rNSA and rPSA as a classifier operating on a binary string space Σℓ, where Σ = {0, 1}. The limited alphabet Σ here is just for understanding the approach; our algorithm can be easily adjusted to real world datasets on arbitrary alphabets. We also use the following notation: Let s ∈ Σℓ be a binary string. Then ℓ = |s| is the length of s and s[i,…,j] is the substring of s with length j – i + 1 that starts at position i. Definition 1 (Chunk detectors). An r-chunk detector (d, i) is a tuple of a string d ∈ Σr and an integer i ∈ {1,…, ℓ - r + 1}. It matches another string s ∈ Σℓ if s[i,…, i + r - 1] = d. Definition 2 (Positive chunk detectors). Given a self set S, an r-chunk detector (d, i) is a positive chunk detector if it matches a substring s[i,…, i + r - 1] of s, s ∈ S. Definition 3 (Negative chunk detectors). Given a self set S, an r-chunk detector (d, i) is a negative chunk detector if it does not matches any substring s[i,…, i + r - 1] of s, s ∈ S. Example 1 shows two detector types generated from a given self set. The example is used several times in the paper. Example 1. Given a self set S has 6 binary strings, with ℓ = 5 and r = 3: S = {s1 = 00000; s2 = 00010; s3 = 10110; s4 = 10111; s5 = 11000; s6 = 11010}. The set of all negative 3- Nguyễn Văn Trường và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ chunk detectors that includes (ℓ - r + 1) subsets is Dn = Dn1 ∪ Dn2 ∪ Dn3 with Dn1 = {(001,1); (010,1); (011,1); (100,1); (111,1)}, Dn2 = {(010,2); (110,2); (111,2)} and Dn3 = {(001,3); (011,3); (100,3); (101,3)}; The set of all positive 3-chunk detectors is Dp = Dp1 ∪ Dp2 ∪ Dp3 with Dp1 = {(000,1); (101,1); (110,1)}, Dp2 = {(000,2); (001,2); (011,2); (100,2); (101,2)} and Dp3 = {(000,3); (010,3); (110,3); (111,3)}. It can see that Dni and Dpi is the complement of each other in {0, 1}3, i = 1, 2, 3. For self and nonself discrimination, all researchers focus only on rNSA: if a given cell s matches any negative r-chunk detector, it is nonself. Our approach can be considered as a dual-rNSA: if a given cell s does not match all positive r-chunk detector, it is nonself. This simple idea leads to our interesting algorithm rPSA described in the following section. R-CHUNK DETECTOR-BASED PSA Our approach is derived from previous work on rNSA, in [6] the authors use suffix pattern as main data structure. We use here binary tree as data structure for both generation and detection. Our algorithm is first construct (ℓ - r + 1) binary trees Ti corresponding (ℓ - r + 1) Dpi sets, i = 1,…, (ℓ - r + 1). Then we delete all complete sub-trees of these trees to achieve a compact representation of the positive rchunk detector set. The detection phase can be operated by traveling the adjusted trees iteratively one by one. Table 1. Comparison of our results with the run times of previously published algorithms rNSA Generating Detecting phase phase r [9] (2 + |S|)(ℓ - r + 1) |D|ℓ [6] |S|(ℓ - r + 1)r2 |S|l2r Ours |S|(ℓ - r + 1)r (ℓ - r + 1)r In Table 1, the parameter |D| is the number of detectors. Our algorithm and the algorithm in [6] produce the results that would be equivalent to the maximal number of generated detectors. But our algorithm’s 102(02): 45 - 49 complexities are much smaller than the others. Moreover, in comparison to our previous research on rNSA [8], our current rPSA firstly does not need to create new nodes in binary tree so it runs quite faster and be easier to implement. Secondly, the rPSA can be updated set D easily and naturally in dynamic environment where set S changes over real time. For the Example 1, Figure 3.a, 3.b, 3.c illustrates the binary tree T1, T2, T3 built from Dp1, Dp2, and Dp3 respectively. In the figure, the dash arrows present sub-trees will be deleted. Moreover, the left child is labeled with 0 and the right one labeled with 1 implicitly. a. Tree T1 c. Tree T2 e. Tree T3 Figure 3. Binary trees constructed from Dpi, i = 1, 2, 3. Our efficient algorithm for positive selection with r-chunk detectors is presented as bellow. Procedure CHUNK_DETECTOR_PSA Input: a self set S, an integer r ∈ {1,…,ℓ} and a cell string s* to be detected. Output: detection of s* as self or nonself. begin for i = 1 to ℓ - r + 1 do 47 Nguyễn Văn Trường và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ begin initialize an empty binary tree Ti; for each s ∈ S do insert s[i,…, r +i-1] into Ti; for every non-leaf node n ∈ Ti do if n is root of complete binary sub-tree then delete this sub-tree; end; if s* does not match any concatenation of labels from root of Ti, i = 1,.., (ℓ - r + 1), to a leaf then output “s* is nonself” else output “s* is self”; end; The procedure of generating a compact representation of complete r-chunk detector is produced by the outer loop for. The binary tree Ti is constructed in the first inner loop, and the deletion of Ti is completed by the second one, i = 1,…, ℓ - r + 1. The procedure of detecting if a given cell string s* is self or nonself is done by the last if… then… else statement. For example, given S and r as mentioned in Example 1, and s* = 10100 is the input of the algorithm. Then three binary trees are constructed as in Figure 3. The output of the algorithm is declaration “s* is nonself” because the all paths of T2 does not contains sub-string of labels s*[2...4] = 010 of s*. We use binary tree as main data structure that constructed from self set S that impacts on time complexities. It is very easy to proof that it takes time |S|(ℓ - r + 1).r to generate all necessary trees and (ℓ - r + 1).r to verify a cell string as self or as nonself. The following table compares our results with run times of the algorithm published in 2009 102(02): 45 - 49 [6] on some inputs. On average our rPSA’s complexities are 13.5 times faster and 31,258,639 times faster than rNSA’s complexities in [6] for generating phase and detecting phase, respectively. The last input data (in the last row) is quite big and suitable in real problems, for example the length of IP package string is 49 bits. It shows clearly in the last row that for these applications, our algorithms can detect abnormal cell nearly immediately, while the rNSA in [6] may take one minute or more (equivalently to needed time to operate 75 billion operations in a normal computer) for the same task. CONCLUSIONS Our efficient approach reduces time complexities of the two phases, generation and detection, to polynomial time. Moreover, our rapid algorithm is easy and natural to implement in computer. This helps to build up large scale AISs with huge data space. In the future, we plan to report more detail experimental data about the algorithm on virus, spam [7] [10] and standard database of network attacks, i.e. KDD CUP’99 data set. We guest that our rPSA can be combined efficiently with rNSA to have more flexible data structure for dynamic environment. ACKNOWLEDGMENT This work was funded by the Vietnam's National Foundation for Science and Technology Development (NAFOSTED) via a research grant for fundamental sciences, grant number: 102.01-2010.09, by the Thai Nguyen University for university’s research, code number DH2011-04-26, by College of Information and Telecommunication Technology’s research, and by Ha Noi University’s research. We would like to thank the Management Boards of these projects. Table 2. Comparison of our results with the run times of the algorithm in [6] 48 |S| ℓ r 20 1,000 100,000 1,000,000 10 20 45 50 4 8 12 30 Generating phase Ours [6] 560 2,240 104,000 832,000 40,800,000 489,600,000 630,000,000 18,900,000,000 Detecting phase Ours [6] 28 8,000 104 3,200,000 408 2,430,000,000 630 75,000,000,000 Nguyễn Văn Trường và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ REFERENCES [1]. Fernando Esponda et al., A Formal Framework for Positive and Negative Detection Schemes, IEEE transactions on systems, man, and cybernetics, 34 (1), 2004. [2]. Forrest et al., Self-Nonself Discrimination in a Computer, in Proceedings of 1994 IEEE Symposium on Research in Security and Privacy, Oakland, CA, 202-212, 1994. [3]. Jamie Twycross et al., Detecting Anomalous Process Behavior using Second Generation Artificial Immune Systems, Journal of Unconventional Computing, Vol. 1, pp. 1–26, 2010. [4]. J. Balthrop et al., Coverage and generalization in an artificial immune system, GECCO 2002, 3-10, 2002. [5]. L. N de Castro and J. Timmis, Artificial Immune Systems: A New Computational Intlligence Approach, Springer-Verlag, 2002. 102(02): 45 - 49 [6]. M. Elberfeld, J. Textor, Efficient algorithms for string-based negative selection, Proceedings of the 8th International Conference on Artificial Immune Systems, LNCS 5666, 109-121, 2009. [7]. Nguyen Van Truong et al., A fast r-chunk detector-based negative selection algorithm, Journal of Science and Technology, Thai Nguyen University, 2(90), 2012, 55-58. [8]. Patrik D’haeseleer et al., An Immunological Approach to Change Detection: Algorithms, Analysis and Implications, IEEE Symposium on Security and Privacy, 1996. [9]. T. Stibor et al., An investigation of r-chunk detector generation on higher alphabets, GECCO 2004, LNCS 3102, 299-307, 2004. Zhou Ji et al., Revisiting Negative Selection Algorithms, Evolutionary Computation 15(2), 2007, 223-251. TÓM TẮT MỘT CÁCH NHÌN KHÁC VỀ THUẬT TOÁN CHỌN LỌC ÂM TÍNH DỰA TRÊN BỘ DÒ R-CHUNK Nguyễn Văn Trường1*, Trịnh Văn Hà2 1 2 Trường Đại học Sư phạm - ĐH Thái Nguyên Trường Đại học Công nghệ thông tin và Truyền thông - ĐH Thái Nguyên Hệ miễn dịch nhân tạo là một lĩnh vực nghiên cứu phong phú kết hợp các nguyên lý miễn dịch học và tính toán. Thuật toán chọn lọc âm tính là một trong số những mô hình tính toán phổ biến về phát hiện self/nonself có thể được dùng cho phát hiện bất thường trong hệ miễn dịch nhân tạo. Nó bao gồm hai giai đoạn: sinh một tập D các bộ dò mà không khớp được với bất kỳ phần tử nào của một tập self cho trước S. Sau đó, sử dụng những bộ dò này để phân biệt một tế bào cho trước là self hay nonself. Một thuật toán chọn lọc âm tính nhanh dựa trên bộ dò loại r-chunk lần đầu được giới thiệu bởi tác giả M. Elberfeld và cộng sự vào năm 2009 [6], tập bộ dò đầy đủ được sinh ra có thể phát hiện được toàn bộ không gian nonself. Trong bài báo này, chúng tôi phát triển thuật toán đối ngẫu âm tính, gọi là thuật toán chọn lọc dương tính sinh tập bộ dò dựa trên r-chunk, có thể phát hiện được phần bù của không gian nonself có cùng độ phức tạp bộ nhớ nhưng giảm độ phức tạp thời gian. Từ khóa: Hệ miễn dịch nhân tạo, thuật toán chọn lọc âm tính, thuật toán toán chọn lọc dương tính, an ninh máy tính, bộ dò r-chunk. Ngày nhận bài:3/1/2013, ngày phản biện:24/1/2013, ngày duyệt đăng:26/3/2013 * Tel: 0915 016063, Email: nvtruongtn@gmail.com 49