A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, Sal Stolfo Presenter: Marbin Pazos-Revilla Cognitive Radio Group TTU-2011 1 Motivation • Machine Learning Algorithms – Cluster – K-Means – SVM • Datasets – KDD Cup • Intrusion Detection • Among best ROC curves and overall IDS performance 2 Contributions • The authors proposed three improved methods for clustering, K-NN and SVM to be used in Unsupervised Intrusion Detection • The methods show to have very good performance (ROC curves) 3 Introduction • Commercially available methods for intrusion detection employ signature based detection • The signature database has to be manually revised for newly discovered signatures and until a new update is applied systems are left vulnerable to new attacks 4 IDS Types • Misuse – Each instance in a set of data is labeled as normal or intrusion, and a machine learning algorithm is trained over the labeled data – Classification rules – Manuel updates are needed • Anomaly – A given normal set data is given – A new set of data is tested and system is supposed to detect whether it is normal or not – It can detect new types of attacks 5 Supervised Anomaly Detection • Supervised Anomaly Detection require a set of purely normal data from which they train their model. If intrusions are present in “normal” data, then these intrusions won’t be detected. • It is hard in practice to have labeled or purely normal data • In the event of having labeled data by simulating intrusions, we would be limited by the set of known attacks in the simulation 6 Unsupervised Anomaly Detection • Goal is to differentiate normal elements from anomalous elements buried in the data • Do not require a purely normal training set • No need for labeled data • Raw data is much easier to obtain 7 Geometric Framework • Maps Data to a d-dimentional Feature Space – Better capture intrusion in this feature space – Represent and map different types of data • Data-dependent normalization feature map • Spectrum Kernel feature map • Points can be classified as outliers (anomalies) based on their position in this space • In general anomalies tend to be distant from other points (parallel with sparse) 8 Datasets and Algorithms • Datasets – KDD CUP 99 data (IDS dataset) – Lincoln Labs DARPA intrusion detection evaluation • Algorithms – Clustering – KNN – SVM 9 Unsupervised Anomaly Detection • Intrusions are buried in the data • Can help in forensic analysis • Assumptions – Most (significant) of the elements are normal – Anomalies are qualitatively different than normal instances • With the previous assumptions anomalies will appear to be rare and different from normal elements and show as outliers 10 Geometric Framework for Unsupervised Anomaly Detection • Mapping records from audit stream to a feature space • The distance between two elements in the feature space then becomes or 11 In many cases is difficult to map data instances to a feature space and calculate distances • High Dimentionality of the feature space (memory considerations) • Explicit map might be difficult to determine We can define a kernel function to compute these dot products in the feature space (Hilbert) Then we could get distances by using Kernel functions 12 • Radial Basis Kernel Function Defined over input spaces which are vector spaces • Using Convolution kernels we can then use arbitrary input spaces. • The author suggests the use of convolution kernels to avoid converting audit data into a vector in 13 Detecting Outliers • Detecting points that are distant from other points or in relatively sparse regions of the feature space 14 Cluster-based Estimation • Count the number of points within a sphere of radius w around the point • Sort clusters based on size • The points in the small clusters are labeled anomalous 15 Cluster-based Estimation • Any points x1,x2 are considered near if their distance is less than or equal to • Define N(x) to be the number of points that are within w of point x • Since we have to compute the pairwise distance among points the computation of N(x) for all points has complexity • We are interested in the outliers 16 • To reduce computation, an approximation can be done via fixed width clustering – The first point is the center of the first cluster – For every subsequent point, if it is within w of a cluster center, it is added to that cluster • Otherwise it becomes the center of a new cluster – Points may be added to several clusters – Complexity with c number of clusters and n number of data points – A threshold on n is used to find outliers 17 K-Nearest Neighbor • Find points that lie in a sparse region of the feature space by computing the distances to the k-nearest neighbors of the point • Dense regions will have many points near them and will have a small k-NN score • If k exceeds the frequency of any given attack and the images of the attack elements are far from the images of the normal elements, then the k-NN score can be used to detect attacks 18 • K-NN is computationally expensive • Since we’re interested in only the k-nearest points to a given point we can reduce the computational cost by using canopy clustering – Canopy Clustering is used to reduce the space into smaller subsets avoiding the need to check every data point 19 Modified Canopy Clustering • Cluster data with fixed-width approach with the variation of placing each element in only one cluster • For each two points x1,x2 in a cluster • And in all cases 20 • Let C be the set of clusters (initially containing all clusters in the data) • At any step, we have a set of points which are potentially among the k-nearest neighbor points. This set is denoted as P. • We also have a set of points that are in fact among the k-nearest points. This set is denotes as K. • Initially K and P are empty 21 • Pre-compute the distance from x to each cluster. • For the cluster with center closest to x we remove it from C and add all its points to P. Called Opening the Cluster • We can use the lower bound on distance given by • For each point xi in P we compare distances to other points in P • If this distance is <dmin we can guarantee that xi is closer to point x than all the points in the clusters in C 22 • In this case we remove xi from P and add it to K • If distance is >dmin then we open the closest cluster and add all the points to P and remove that cluster from C • Every time we remove a cluster from C dmin will increase • Once K has k elements we terminate 23 • Computation is spent checking distance between points in D to the cluster centers, which is more efficient than computing pairwise distances among all points • Choice of w effects only the efficiency, not the K-NN score • Intuitively we want to choose a w that splits the data into reasonably sized clusters 24 One Class SVM • Map feature space into a second feature space with a radial basis kernel Standard SVM requires supervised learning algorithms (it requires labeled data) 25 • A newly modified SVM was adapted to unsupervised learning algorithm • Attempts to separate the entire set of data from the origin with maximal margin • Classes will be labeled as +1 and -1 26 • The hyperplane is estimated by the hyperplane’s normal vector in the feature space w and offset from the origin Decision function 27 Optimization is solved with a variant of Sequential Minimal Optimization 28 Feature Space • Data Sets – Network Records with 41 features and 4,900,00 instances (KDD Cup 1999 Data) – System Call Traces (process) from 5 weeks from the Basic Security Module of the MIT Lincoln Labs IDS Evaluation created on 1999 29 Experimental Results 30 ROC Curves 31 • Questions 32