One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses Krysta Svore Joint Work with Katherine Heller, Angelos Keromytis, and Salvatore Stolfo Anomaly Detection We would like to be able to detect anomalous behavior by monitoring Windows registry queries We studied a host-based intrusion detection system that applies a one class support vector machine algorithm within the Registry Anomaly Detection (RAD) system Key Results We show that the Probabilistic Anomaly Detection (PAD) algorithm dramatically outperforms the OCSVM algorithm due to its hierarchical prior To improve the performance of the OCSVM algorithm, we need knowledge on how to construct well-defined kernels Host-based Intrusion Detection Systems (IDS) Most often attacked operating system is Microsoft Windows Combative methods Virus scanner and security patches Unable to combat unknown attacks Requires frequent updates Host-based IDS Detects intrusions by monitoring system accesses Used with data mining techniques for improved detection ability Our Approach Use the Registry Anomaly Detection (RAD) system to monitor Windows registry queries Apply a One Class Support Vector Machine (OCSVM) algorithm within the RAD system Compare our system with previous work using the Probabilistic Anomaly Detection (PAD) algorithm within the RAD system The Windows Registry Stores configuration settings for programs, security information, user profiles, and other system information Consists of registry keys and their associated values Queried by programs for information by accessing a specific registry key Process: EXPLORER.EXE Query: OpenKey Key: HKCR\CLSID\B41DB860-8EE4-11D2-9906E49FADC173CA\shellex\MayChangeDefaultMenu Response: SUCCESS ResultValue: NOTFOUND The Windows Registry Activity varies from several accesses per minute to several thousand accesses per minute Almost all programs use the registry Registry Anomaly Detection (RAD) System Three components: Audit Sensor Model Generator Analyzes the registry access Develops a model of normal behavior Anomaly Detector Classifies a new registry access as normal or attack according to the generated model Probabilistic Anomaly Detection (PAD) Algorithm Developed by E. Eskin et al., 2002 Relies on consistency checks over normal data and labels a record anomalous if it fails any one of these tests First order consistency check Verifies a value is consistent with observed values of that feature in the normal data set 5 total checks Second order consistency check Determines the conditional probability of a feature value given another feature value 20 total checks PAD Probability of an observed element P(X = i) = C(Ni + α)/(kα+N) where N - total number of observations Ni - number of observations of symbol i α - “pseudo count” for each observed symbol k - number of observed symbols Probability of an unobserved element P(X = i) = (1-C)/(L-k) L - number of possible symbols Scaling factor C= N/(N+L-k) An Example LAN We expect the probability of seeing a previously unobserved IP address to be small Peering Center We expect the probability of seeing a previously unobserved IP address to be almost 1 One Class Support Vector Machine (OCSVM) Algorithm Maps input data into a high dimensional feature space Iteratively finds the maximal margin in the hyperplane which best separates the training data from the origin Solves optimization problem to find rule f with maximal margin f(x)=‹w,x›+b If f(x)<0 , label x as anomalous OCSVM OCSVM: Kernels Equivalent to solving the dual quadratic programming problem minα (1/2) ∑i,j αiαjK(xi,xj) s.t. 0≤αi≤1/(νl) , ∑i αi = 1 where αi - Lagrange multiplier v - parameter to control trade-off between distance of hyperplane from the origin and number of points in training dataset l - number of points in training dataset Kernel function projects input vectors into a feature space allowing for nonlinear decision boundaries Feature map: Kernel Function: Φ: X → RN K(xi,xj) = ‹Φ(xi), Φ(xj)› Efficiency Comparison Complexity of PAD Time: O(v 2 R 2 ) Space: O(vR 2 ) where v – number of unique record values R – number of record components Complexity of OCSVM Time: O(dL3 ) Space: O(d(L+T)) where d – number of dimensions L – number of records in training dataset T – number of records in test dataset Experimental Data Collected training data on Windows NT 4.0 Used approximately 500,000 attack-free records to train system Tested system using approximately 300,000 records with 2,000 labeled as attack Experiments Used three common kernels Linear: K(x,y) = (x·y) Polynomial: K(x,y) = (x·y+1)d , where d is the degree of the polynomial Gaussian: K(x,y) = e -║x-y║2/(2σ2) , where σ2 is the variance Used two types of feature vectors Binary One dimension for every unique entry for each of the five given record values Frequency-based Each feature corresponds to the number of occurrences of the corresponding record component in the training set Dimensionality of Binary and Frequencybased Feature Vectors Binary feature vector Dimension = Number of possible entries for feature value 1 * number of possible entries for feature value 2 * … * number of possible entries for feature value 5 Frequency-based feature vector Dimension = 5 Evaluation Used two statistics to compare the system’s accuracy Detection rate False positive rate Percentage of attack records that have been correctly identified Percentage of normal records that have been mislabeled as anomalous Plotted results with a Receiver Operator Characteristic (ROC) curve Plots percentage of false positives versus percentage of true positives Results Results Results PAD algorithm outperforms OCSVM algorithm due to the use of a hierarchical prior to estimate the probabilities Knows the likelihood of encountering a previously unencountered feature value Ability of OCSVM to detect anomalies relies on the choice of the kernel Need a novel, well-defined kernel which accounts for highly discriminative information Conclusions Registry activity provides a good platform for anomaly detection PAD is more reliable than OCSVM Currently no known way to learn a “most optimal” kernel By analyzing different algorithms, we can identify what needs to be captured in the kernel definition Future Work Improve kernel definition to incorporate more discriminative information Update the model as new data is labeled Requires an efficient way of remodeling the data over time Test our system on the UNIX platform