Net-Centric Software & Systems Consortium Kick-off Meeting February 26-27, 2009 Self-Detection of Abnormal Event Sequences Farokh B. Bastani I-Ling Yen Latifur Khan UT-Dallas ilyen@utdallas.edu UT-Dallas bastani@utdallas.edu UT-Dallas lkhan@utdallas.edu Net-Centric Software & Systems Consortium Kick-off Meeting Problem Description • There are numerous types of event-based workflows in net-centric systems • E.g., Call control signal processing, network accesses, access to resources, access to data, etc. • Need for abnormal behavior detection • Event-based workflows may incur software & system faults, operational errors, attacks, fraud, illegitimate manipulations, resulting in abnormal behaviors • If the abnormal behavior can be detected, proactive techniques can be used to mitigate the problem 3/23/2016 Net-Centric Software & Systems Consortium 2 Net-Centric Software & Systems Consortium Kick-off Meeting Existing Solutions • Many data mining and machine learning algorithms can be used to classify normal and abnormal events • Bayesian networks, neural networks, decision trees, K-mean, support vector machines (SVM), hidden Markov models, etc. • Problem: Which method to use? • Data set dependent Must explore the best approach for each dataset • Feature extraction from raw data can have significant impact on the prediction quality Must explore various feature extraction models • Problem: How to mine event sequences? • Automata based approach: Known event sequences, cluster them and determine the abnormal ones (no well established clustering techniques) • Episode based approaches: Need to mine the event sequences first, and then cluster them and determine the abnormal ones (has well established episode mining techniques, but not much research on clustering) 3/23/2016 Net-Centric Software & Systems Consortium 3 Net-Centric Software & Systems Consortium Kick-off Meeting Our Solution • Multivariate automata and episode mining • Unknown event sequence: Use episode mining • Automata merging for known or mined event sequences • Multiple variables result in a huge state space • Use dominance parameters and weights to merge states • Develop techniques to merge automata efficiently (hashing, clustering) • Identify abnormal event sequences • Use clustering techniques to identify outliers • Need effective clustering techniques • Need to handle event sequences with different lengths • Need to integrate inter-event parameters in the clustering process • Manual help to identify actual faulty event sequences offline 3/23/2016 Net-Centric Software & Systems Consortium 4 Net-Centric Software & Systems Consortium Kick-off Meeting Our Solution (Cont.) • Develop a feedback based self-improving mechanism • When the prediction error exceeds a threshold, adjust the algorithm • Use multiple algorithms to provide fine tuning Faulty data • E.g., use weighted decision from multiple algorithms injector feedback • Fine tune feature set extractions and use dimension reduction Current prediction mechanisms to obtain faster and better results Classifier Data sets • Off-line analysis to achieve improvements and feed the improvements to the online model Classifier Analysis All data sets • Adjusted algorithm, revision of features, addition of inter-ES features • Develop fault-injection techniques to induce self-learning • Establish the faulty pattern library from data that have been learned • Inject faulty patterns to train the mining process and to measure the effects (use faulty pattern library and develop fault generation algorithms) 3/23/2016 Net-Centric Software & Systems Consortium 5 Net-Centric Software & Systems Consortium Kick-off Meeting Experimental Plan • Develop techniques for abnormal event sequence detection • Develop automata generation and merging techniques • Study the effects of various clustering algorithms on various event sequence datasets • Consider signal flow data from Cisco • Consider network-based intrusion detection datasets • Consider human interoperations (if possible) • Develop the models and methods for dynamic adaptation • Algorithmic adaptation and feature set extraction adaptation 3/23/2016 Net-Centric Software & Systems Consortium 6 Net-Centric Software & Systems Consortium Kick-off Meeting Industry Member Benefits • The abnormal behavior prediction approach can be applied to many net-centric applications that are event-based and workflow-oriented • Call control signal processing • Resource and database access control • System health monitoring for real-time embedded systems, including avionic systems, space-based systems, etc. • Application-dependent workflows, e.g. monitoring the behavior of drivers on roads • Need real data and related knowledge from industry for analysis, model construction, effectiveness analysis 3/23/2016 Net-Centric Software & Systems Consortium 7 Net-Centric Software & Systems Consortium Kick-off Meeting Deliverables and Budget • First year, $30K: Develop the basic multivariate automata mining and abnormal sequence detection techniques • First quarter: Work with industrial partner to understand the data and develop pre-processor to extract event patterns • Second and third quarter: Develop the automata merging and automata clustering techniques • Fourth quarter: Apply the techniques to the dataset and validate the approaches • Second year, $30K: Develop dynamic learning techniques • Develop the feedback learning approach • Develop tools to efficiently achieve self learning 3/23/2016 Net-Centric Software & Systems Consortium 8 Many data clustering algorithms can be used for abnormal event detection. But they do not self adapt and data features have to be identified preliminarily Key objectives: MAIN ACHIEVEMENT: Applied data clustering algorithms to various data sets to study their effectiveness. The experiments show that Support Vector Machine yields the best results for 90% of the data sets. Developed improved SVM algorithm to further improve data clustering outcomes. Developed methods for clustering sparse data sets. • Dynamic learning • Adaptive feature extraction HOW IT WORKS: QUANTITATIVE IMPACT STATUS QUO Net-Centric Software & Systemsdescription Consortium Topic/project/effort Kick-off Meeting END-OF-PHASE GOAL NEW INSIGHTS ASSUMPTIONS AND LIMITATIONS: • Availability of data Comparison of Prediction Accuracy With dynamic learning: More accurate abnormal event prediction results and less false alarms. Automated Feature extraction: Obtain features that can optimize the prediction effectiveness. Dynamic Learning: Develop a feedback based self-improving mechanism to improve clustering algorithm on-the-fly based on a small set of data and verify the improvement off-line on a large volume of historical data Automated feature extraction: Build workflow and event model to allow automatic extraction of data features, including event characteristics, interevent effects, etc. Try to improve the precision of abnormality prediction by improvement on extracted features. Methods\Datasets dataset A dataset B Item-Based 61.326 60.132 User-Based 61.271 60.321 LPWSI 68.3 67.0 LPKA 71.5 70.0 KAWOK 72.5 70.4 BIC-aiNet 82 (80% training, 20% test) BMKL 84.03 83.53 BMKL with NSM 84.13 83.71 Develop an abnormal event detection algorithm that can dynamically adapt through learning and can automatically extract the best features for optimal prediction Apply the technique to Cisco signal flow data and for network intrusion detection Net-Centric Software & Systems 3/23/2016 Can be used for abnormal event sequence detection for many event based applications. Consortium 9