Evolutionary Neural Networks 1 Backgrounds Why NN+EC? • “Evolving brains”: Biological neural networks compete and evolve – The way that intelligence was created Optimal solution • Global search Local Max Initial weights Population Samples • Adaptation to dynamic environments without human intervention – Architecture evolution 2 Backgrounds General Framework of EANN [X. Yao] 3 Backgrounds Evolution of Connection Weights 1. Encode each individual neural network’s connection weights into chromosomes 2. Calculate the error function and determine individual’s fitness 3. Reproduce children based on selection criterion 4. Apply genetic operators 4 Backgrounds Representation • Binary representation – Weights are represented by binary digits • e.g. 8 bits can represent connection weights between -127 and +127 – Limitation on representation precision • too few bits → some numbers cannot be approximated • too many bits → training might be prolonged • To overcome binary representation, some proposed using real number – i.e., one real number per connection weight • Standard genetic operators such as crossover not applicable to this representation – However, some argue that it is possible to perform evolutionary computation with only mutation – Fogel, Fogel and Porto (1990): adopted one genetic operator – Gaussian random mutation 5 Backgrounds Evolution of Architectures 1. Encode each individual neural network’s architecture into chromosomes 2. Train each neural network with predetermined learning rule 3. Calculate the error function and determine individual’s fitness 4. Reproduce children based on selection criterion 5. Apply genetic operators 6 Backgrounds Direct Encoding • All information is represented by binary strings, i.e. each connection and node is specified by some binary bits • An N by N matrix C (cij ) N N can represent the connectivity with N nodes, where 1, if connection is ON cij 0, if connection is OFF • Does not scale well since large NN needs a big matrix to represent 7 Backgrounds Indirect Encoding • Only the most important parameters or features of an architecture are represented. Other details are left to the learning process to decide – e.g. specify the number of hidden nodes and let the learning process decide how they are connected (e.g. fully connected) • More biologically plausible as it is impossible for genetic information encoded in humans to specify the whole nervous system directly according to the discoveries of neuroscience 8 Backgrounds Evolution of Learning Rules 1. Decode each individual into a learning rule 2. Construct a neural network (either pre-determined or randomly) and train it with decoded learning rule • refers to adapting the learning function, in this case, the connection weights are updated with an adaptive rule 3. Calculate the error function and determine individual’s fitness 4. Reproduce children based on selection criterion 5. Apply genetic operators 9 Two Case Studies • Evolving intrusion detector • Evolving classifier for DNA microarray data 10 Evolutionary Learning Program’s Behavior In Neural Networks for Anomaly Detection 11 Motivation (1) • Attacker’s strategy: Leading to malfunctions by using program’s bug – Showing different behavior compared to normal one • Anomaly detection – Learning normal program’s behavior from audit data – Classifying programs which show different behavior with normal one as intrusion – Adopted in many host-based intrusion detection system • System audit data and machine learning techniques – Basic security module (BSM) – Rule-based learning, neural network and HMM 12 Motivation (2) • Machine learning methods such as Neural network (NN) and HMM – Effective for intrusion detection based on program’s behavior • Architecture of classifier – The most important thing in classification – Searching for appropriate architecture for the problems is crucial • NN: the number of hidden neurons and connection information • HMM: the number of states and connection information • Traditional methods – Trial-and-error • Train 90 neural networks [Ghosh99] It took too much time because the size of audit data is too large Optimizing architectures as well as connection weights 13 Related Works • S. Forrest (1998, 1999) – First intrusion detection by learning program’s behavior – HMM performed better than other methods • J. Stolfo (1997) : Rule-based learning (RIPPER) • N. Ye (2001) – Probabilistic methods: Decision tree, chi-square multivariate test and one order Markov chain model (1998 IDEVAL data) • Ghosh (1999, 2000) – Multi-layer perceptrons and Elman neural network – Elman neural network performed the best (1999 IDEVAL data) • Vemuri (2003) – kNN and SVM (1998 IDEVAL data) 14 The Proposed Method • Architecture – System call audit data and evolutionary neural networks Audit Data Normal Profile ps NNps su . . . ping 15 NNat NNlogin . . . NNping Detector login GA Modeler Preprocessor BSM Audit Facility at NNsu ALARM Normal Behavior Modeling • Evolutionary neural networks – Simultaneously learning weights and architectures using genetic algorithm – Partial training: back-propagation algorithm – Representation: N N matrix • Rank-based selection, crossover, mutation operators • Fitness evaluation : Recognition rate on training data (mixing real normal sequences and artificial intrusive sequences) Generating neural networks with optimal architectures for learning program’s behavior 16 ENN (Evolutionary Neural Network) Algorithm Generate initial ANNs BSM data Train the ANNs partially Data separation Compute the fitness Rank-based seletion Apply crossover and mutation Test data Training data Generate new generation Stop? Yes Train the ANNs fully Evaluation 17 No Representation I1 I 1 0.0 H 1 0.4 H 2 0.5 H 3 0.0 O1 0.1 H1 1.0 0.0 0.0 0.0 0.7 H2 1.0 0.0 0.0 0.1 0.2 Input Node Hidden Node Output Node H3 0.0 0.0 1.0 0.0 0.7 O1 1 .0 1.0 1 .0 1 .0 0.0 I1 I 1 0.0 H 1 0.4 H 2 0.5 H 3 0.0 O1 0.1 Generation of Neural Network I1 0.5 H2 0.1 H3 0.7 0.2 0.7 0.1 H1 1.0 0.0 0.0 0.0 0.7 H2 1.0 0.0 0.0 0.1 0.2 Weight 18 0.4 H1 H3 0.0 0.0 1.0 0.0 0.7 O1 1 .0 1.0 1 .0 1 .0 0.0 Connectivity O1 Crossover (1) 0.4 I1 0.5 H1 H2 0.7 0.2 0.4 I1 19 0.5 0.4 0.5 0.4 0.1 0.7 H3 0.1 0.1 I1 O1 H3 0.7 0.1 0.5 O1 H2 H3 0.1 Crossover H1 0.2 0.1 H2 0.5 O1 H1 0.1 I1 0.5 H1 H2 0.2 0.2 O1 0.1 0.7 H3 Crossover (2) I1 H1 H2 H3 O1 I1 0.0 0.4 0.5 0.0 0.1 H 1 H 2 H 3 O1 1.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.1 0.0 1.0 0.7 0.2 0.7 0.0 I1 H1 H2 H3 O1 I1 0.0 0.4 0.5 0.4 0.1 H 1 H 2 H 3 O1 1.0 1.0 1.0 1.0 0.0 1.0 0.0 1.0 0.1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.7 0.5 0.0 0.0 I1 0.0 0.1 0.5 0.0 0.0 H 1 H 2 H 3 O1 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.1 0.0 1.0 0.2 0.2 0.7 0.0 Crossover I1 H1 H2 H3 O1 20 I1 0.0 0.1 0.5 0.4 0.0 H 1 H 2 H 3 O1 1.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 0.1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.2 0.5 0.0 0.0 I1 H1 H2 H3 O1 Mutation H1 H1 0.4 I1 0.5 H2 I1 0.7 0.5 0.3 H2 H3 H1 H1 0.4 0.7 H2 0.2 0.1 0.1 0.7 O1 0.5 O1 0.7 0.7 Delete Connection I1 0.2 0.1 0.1 H3 21 O1 Add Connection 0.7 0.1 0.4 0.5 0.2 0.1 H3 I1 0.4 0.7 H2 0.1 H3 0.1 0.2 O1 Anomaly Detection (1) • 280 system calls in BSM audit data – 45 frequently occurred calls (indexing as 0~44) – Indexing remaining calls as 45 exit fcntl ioctl munmap fork rename pipe seteuid creat mkdir setuid putmsg unlink fchdir utime getmsg chown open -read setgid auditon access open - write mmap memcntl stat open - write,creat audit sysinfo lstat open - write,trunc setgroups close readlink open - write,creat,trunc setpgrp getaudit execve open - read,write chdir pathconf vfork open - read,write,crea • 10 input nodes, 15 hidden nodes (Maximum number of hidden nodes), 2 output nodes – Normalizing input values between 0 and 1 – Output nodes: Normal and anomaly 22 Anomaly Detection (2) Output value • Evaluation value will rise up shortly when intrusion occurs – Detection of locally continuous anomaly sequence is important 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Abnormal normal 1 11 21 31 41 51 61 Time 71 81 91 101 111 – Considering previous values t w1 t 1 w2 o1t w3 ot2 • Normalizing output values for applying the same threshold to all neural networks t m ' – m: Average output value for training data, d: std t d 23 Experimental Design • 1999 DARPA IDEVAL data provided by MIT Lincoln Lab – Denial of Service, probe, Remove-to-local (R2L), User-to-root (U2R) – Main focus: Detection of U2R attack • Bearing marks of traces in audit data • Monitoring program’s behavior which has SETUID privilege – Main target for U2R attack 24 at rsh sendmail deallocate atq su utmp_update list_devices atm uptime accton ffbconfig chkey w xlock ptree crontab yppasswd ff.core pwait eject volcheck kcms_configure ssh fdformat ct kcms_calibrate sulogin login nispasswd mkcookie admintool newgrp top allocate sulogin passwd quota mkdevalloc whodo ps ufsdump mkdevmaps pt_chmod rcp ufsrestore ping rlogin rdist exrecover sacadm Experimental Design (2) • 1999 IDEVAL : audit data for 5 weeks – 1, 3 weeks (attack free) training data – 4-5 weeks test data • • Test data includes totally 11 attacks with 4 types of U2R Name Description Times eject exploiting buffer overflow in the 'eject' program 2 ffbconfig exploiting buffer overflow in the 'ffbconifg' program 2 fdformat exploiting buffer overflow in the 'fdformat' program 3 ps race condition attack in 'ps' program 4 Setting of genetic algorithm – Population size: 20, crossover rate: 0.3 mutation rate: 0.08, Maximum generation:100 – The best individual in the last generation 25 Evolution Results • Convergence to fitness 0.8 near 100 generations 1 0.9 0.8 0.7 fitness 0.6 average minimum maximum 0.5 0.4 0.3 0.2 0.1 0 1 26 12 23 34 45 56 generations 67 78 89 100 Learning Time • Environments – Intel Pentium Zeon 2.4GHz Dual processor, 1GB RAM – Solaris 9 operating system Hidden Running Types Nodes Time (sec) • Data 10 235.5 – Login program 15 263.4 – Totally 1905 sequences 20 454.2 • Parameters – Learning for 5000epoch 25 482 – Average of 10 runs MLP 30 603.6 ENN 27 35 700 40 853.6 50 1216 60 1615 15 4460 Detection Rates • 100% detection rate with 0.7 false alarm per day 1 0.9 Detection Rate 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 False Alarm Per Day • Elman NN which shows the best performance for the 1999 IDEVAL data : 100% detection rate with 3 false alarms per day 28 Effectiveness of Evolutionary NN for IDS Results Analysis – Architecture of NN • The best individual for learning behavior of ps program – Effective for system call sequence and more complex than general MLP 29 Comparison of Architectures • Comparison of the number of connections between ENN learned for 100 generations using ps program data and MLP • They have the similar number of connections • However, ENN has different types of connections and sophisticated architectures MLP ENN FROM╲TO Input Hidden 30 Output FROM╲TO Input Hidden Output Input 0 150 0 Input 0 86 15 Hidden 0 0 30 Hidden 0 67 19 Output 0 0 0 Output 0 0 0 Evolving Artificial Neural Networks for DNA Microarray Analysis 31 Motivation • Colon cancer : The second only to lung cancer as a cause of cancer-related mortality in Western countries • The development of microarray technology has supplied a large volume of data to many fields • It has been applied to prediction and diagnosis of cancer, so that it expectedly helps us to exactly predict and diagnose cancer • Proposed method – Feature selection + evolutionary neural network (ENN) – ENN : no restriction on architecture (design without human’s prior knowledge) 32 What is Microarray? • Microarray technology – Enables the simultaneous analysis of thousands of sequences of DNA for genetic and genomic research and for diagnostics • Two Major Techniques – Hybridization method • cDNA microarray/ Oligonucleotide microarray – Sequencing method • SAGE 33 Acquiring Gene Expression Data DNA microarray Cy3 Image scanner log Genes Cy5 Int(Cy 5) 2 Int(Cy 3) Gene expressin data Samples 34 Machine Learning for DNA Microarray Microarray Pearson's correlation coefficient Spearman's correlation coefficient Euclidean distance Cosine coefficient Information gain Mutual information Signal to noise ratio Expression data Feature selection 3-layered MLP with backpropagation k-nearest neighbor Support vector machine Structure adaptive self-organizing map Ensemble classifier Cancer predictor Tumor 35 Normal Related Works Method Authors Feature Classifier Furey et al. Signal to noise ratio SVM 90.3 Li et al. Genetic algorithm KNN 94.1 Nearest neighbor 80.6 SVM with quadratic kernel 74.2 AdaBoost 72.6 Logistic discriminant 87.1 Quadratic discriminant 87.1 Logistic discriminant 93.5 Quadratic discriminant 91.9 Ben-Dor et al. All genes, TNoM score Principal component analysis Nguyen et al. Partial least square 36 Accuracy (%) Overview Microarray data Generate initial ANNs Feature selection Train the ANNs partially Data separation Compute the fitness Rank-based seletion Apply crossover and mutation Test data Validation data Training data Generate new generation Stop? Yes Train the ANNs fully Evaluation 37 No Colon Cancer Dataset • Alon’s data • Colon dataset consists of 62 samples of colon epithelial cells taken from colon-cancer patients – 40 of 62 samples are colon cancer samples and the remaining are normal samples • Each sample contains 2000 gene expression levels • Each sample was taken from tumors and normal healthy parts of the colons of the same patients and measured using high density oligonucleotide arrays • Training data: 31 of 62, Test data: 31 of 62 38 Experimental Setup • Feature size : 30 • Parameters of genetic algorithm – Population size : 20 – Maximum generation number : 200 – Crossover rate : 0.3 – Mutation rate : 0.1 • Fitness function : recognition rate for validation data • Learning rate of BP : 0.1 39 Performance Comparison 1 1: EA N N 2: M LP 3: SA SO M 4: SVM (Linear) 5: SVM (R B F) 6: K N N (C osine) 7: K N N (Pearson) 0.94 0.9 A c c u ra c y 0.8 0.71 0.71 0.71 0.7 0.81 0.71 0.74 0.6 1 2 C la s s ifie r 40 3 4 5 6 S1 7 Sensitivity/Specificity • Sensitivity = 100% • Specificity = 81.8% • Cost comparison – Classifying cancer person as normal person > classifying normal person as cancer person EANN Actual 41 Predicted 0 (Normal) 1 (Cancer) 0 (Normal) 9 2 1 (Cancer) 0 20 Architecture Analysis Whole architecture From input to hidden neuron 42 Architecture Analysis (2) Input to output Input to output relationship is useful to analyze Hidden neuron to output neuron Hidden neuron to hidden neuron 43