EMM Beta Release Date March 29, 2008 General Information: This is a Java implementation of the Extensible Markov Model (EMM) spatiotemporal modeling technique [3]. It consists of code to create an EMM (using a training dataset), and then uses the created EMM model for some application. The Beta version only supports anomaly (rare event) detection. General EMM information is available online at http://www.engr.smu.edu/cse/dbgroup/emm.html [1] and in published articles [3]. All code has been developed by Jie Huang and Charlie Isaksson. The project implementation is JAVA and the environment programmed in is NetBeans IDE 6.0.1. However other IDE can be used. The following discussions all assume that NetBeans is used. You can download NetBeans from: http://www.netbeans.org/. If you don't have JDK installed on your system, then you can download (JDK 6 Update 5 with NetBeans 6.0.1.) from the Sun home page : http://java.sun.com/javase/downloads/index.jsp . From this link you may choose to download only JDK or a bundle of JDK and NetBeans. Installation: The project is delivered compressed with tar.gz. To uncompress in a Unix environment use:tar xfz JEMM.tar.gz In a Windows environment, you can use WinZip or other software. Extract all files. At this point a directory entitled JEMM is created. There are five subdirectories: build: Executables (including classes) dist: Created by NetBeans (jar files) nbproject: Created by NetBeans src: Java source files test: Created by NetBeans . Execution: 1. When NetBeans is started, begin execution of EMM by opening an existing project in the File link. Select the location of the JEMM folder created when the download file was uncompressed. At this point you will see JEMM listed as an existing project. Future executions can then simply use this project. 2. At this point all of the EMM libraries and code are inserted into the IDE environment. 3. Begin EMM execution by either clicking on the green arrow or on Run Main project tab on the NetBeans toolbar. Each EMM run creates a new EMM from a supplied input dataset and executes the chosen application against that EMM. The EMMBeta version executes only the anomaly detection application. EMM execution is performed through a GUI interface as seen in Figure 1. Two tabs can be found in the GUI: The EMM tab is used to setup and run EMM. The Visualization tab is used to see the results of running an EMM application (anomaly detection). 2/16/2016 EMM Beta 1 Figure 1. EMM GUI Parameters: Prior to running EMM, the following input parameters must be supplied: Similarity Measure: You have the option to choose the similarity measurements: Cosine, Dice, Jaccard, and Overlap [2]. This is used to determine the similarity between an input vector found in the input dataset to all existing clusters. The current implementation assumes that clusters are represented by centroids of the vectors placed in that cluster. Threshold: Once you have chosen the similarity metric the threshold must be selected. A default threshold is set to 0.8, however any value from 0 to 1 can be set. The threshold value is used to determine whether the input vector should be placed in a cluster or not. Using a simple nearest neighbor (in this implementation) approach, the centroid of the closest existing cluster is found. If the similarity is less than the threshold then the input vector is added to the cluster. If it isn’t, then a new cluster is created with the input vector as the starting centroid. Input File Options: There are three options for invoking the input files (training and testing): Use Training Set: If this option is clicked, an EMM is constructed using the indicated dataset, but no application on that EMM is performed. Supplied Test Set: With this option two separated input files are supplied. The first is for training and the second is for testing. Once the EMM is constructed using the training dataset, the testing automatically begins using the indicated application. During this testing phase, learning of the EMM continues. Percentage Split: Only one dataset is supplied and is divided into a training and test set based on the split percentage indicated. The entered value is the amount used for training. The remaining part is then used for testing. The input dataset must be indicated by clicking on the Set Train 2/16/2016 EMM Beta 2 button. The name of the file(s) to be used are entered by clicking on the Set Train (and Set Test) buttons. Based on the chosen input file option, the needed input file buttons are highlighted. Only files from the JEMM directory (or subdirectories can be used at this time). Input File Format: The input file is assumed to be a simple text file. Each input vector is represented by one line of input. The individual vector values are numeric and separated by TABS. Each vector is separated by a LINEFEED. An attempt to execute with an invalid file or one not found inside the JEMM directory will result in an error message “FileNotFoundException”. Execution Output: With a successful EMM run, the output is shown in the EMM Output panel. A sample output is shown in Figure 2. The “== New states, training mode ==“ is followed by a list of all states created. States are labeled with integers assigned in ascending numeric order. (State 1 is always assumed to exist by default.) The “== EMM Transitions ==“ heading is followed by a list of state transitions with the corresponding count for each. == New states, training mode == new state: 2 new state: 3 new state: 4 new state: 5 new state: 6 new state: 7 == EMM Transitions == 5:2 1:2 6:4 2:6 2:3 2:1 3:4 4:5 5:7 1 2 1 1 1 1 1 2 1 Figure 2. EMM Output When EMM is executed, a trace of the run is produced on the bottom panel of the NetBeans execution screen. This output labeled (Output – JEMM(run)) contains a trace of execution. A sample run output is shown in Figure 3. The first five lines here trace the execution sequence including a timestamp. The last two lines indicate that a file not found error has occurred. 2/16/2016 EMM Beta 3 init: deps-jar: compile: run: Mar 28, 2008 1:12:49 PM jemm.JEMMView buStartMouseClicked SEVERE: null java.io.FileNotFoundException: MaggiDataset.txt (The system cannot find the file specified) Figure 3. Sample EMM Trace The screen created with the Visualization tab is shown in Figure 4. This screen is divided into two panels. The top panel shows the anomaly input states. An overview of the anomaly detection algorithm can be found in [3]. The bottom panel shows a synopsis of the EMM structure which exists at the end of the EMM run. Shown is the centroid of each EMM node (in cluster number order) followed by the count of input states placed in this node. Figure 4. Visualization Screen 2/16/2016 EMM Beta 4 Output Files: EMM output files are placed in the JEMM folder. Each run creates three output files: xxx_link – Text file showing the transitions and their counts. (Same as that seen in Figure 2.) xxx_numOfState – Text file showing states and their counts. xxx_state – Text file showing centroids for all clusters (states) and their counts. (Same as that in Figure 4.) Here xxx is replaced with either Dice, Jaccard, or Cosine based on the similarity measure used for that run. Example: A simple example of EMM creation can be found in [3]. This sample dataset is found in the JEMM directory under the name of Maggidataset.txt and is shown in Figure 5. Here each input vector consists of 7 numeric values. There are 12 input states. The EMM which is created using this input (Jaccard, 0.8) is shown in Figure 4. This steps in creation of this EMM are shown as Figure 2 in [3]. 20 20 40 15 40 5 0 20 45 15 5 10 50 80 30 60 15 5 35 60 40 20 45 30 100 50 75 30 25 40 55 30 15 40 55 10 30 20 20 30 10 35 2 11 18 40 10 4 25 10 30 10 35 10 1 20 20 10 10 15 4 10 20 10 40 5 3 15 20 10 15 15 10 10 25 15 9 4 5 10 15 14 0 10 Figure 5: Sample Input File (Maggidataset.txt) Sample Input Files: There are several other sample output files in addition to the Maggidataset.txt. These include those based on DARPA Intrusion Detection datasets as reported in [4]: normalDataDARPA1999.txt: Contains training data-set for one weeks without any network attacks. normalDataDARPA1999_AND_2000DDoS.txt: Contains training data-set for two weeks without any network attacks 2000Attack.txt: Contains testing data-set with network attacks. Error Messages: Contact: The application is still under development. Please contact Charlie Isaksson at charliei@engr.smu.edu if you have any questions. The overall EMM project is under the direction of Professor Margaret H. Dunham who can be contacted at mhd@engr.smu.edu . 2/16/2016 EMM Beta 5 EMM References: [1] Dunham, Margaret H., EMM Web Page, http://www.engr.smu.edu/cse/dbgroup/emm.html . [2] Dunham, Margaret H., Data Mining Introductory and Advanced Topics, Prentice Hall, 2003. [3] Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371-374. [4] Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265. 2/16/2016 EMM Beta 6