"Politehnica" University of Timişoara Faculty of Automation and Computers Departament of Computer and Software Engineering 2, Vasile Pârvan Bv., 300223 – Timişoara, Romania Tel: +40 256 403261, Fax: +40 256 403214 Web: http://www.cs.upt.ro A NOMALY D ETECTION IN D ATA M INING . H YBRID A PPROACH BETWEEN F ILTERING AND -R EFINEMENT AND DBSCAN Dissertation Thesis Ștefan-Iulian Handra Supervisors: Prof. Dr. Eng. Horia Ciocârlie Optionally (and if there is enough available space), a representative picture can be inserted here. Timişoara, 2011 1 Summary 1. Introduction ...................................................................................................................... 6 1.1. The importance of anomaly detection. ....................................................................... 6 1.2. Short description of this paper. ................................................................................... 6 2. Problem statement ........................................................................................................ 7 3. Theoretical Foundation ................................................................................................ 8 4. State-of-the-art ............................................................................................................... 9 5. Proposed Solution and Research Methodology ................................................. 10 6. Implementation ............................................................................................................ 11 6.1. Technologies used ....................................................................................................... 11 6.1.1. WEKA ..................................................................................................................... 11 6.1.2. The JAVA programming language ...................................................................... 11 6.1.3. Eclipse environment ............................................................................................... 11 One of the most well known and recognized IDE’s (Integrated Development Environment) is the Eclipse IDE. It is used for a wide variety of programming languages and has many plugins suited for any need of analyzing the code or it’s properties both statically and dynamically. .................................................................................................................................................. 11 As we chose to work with JAVA as a primary language to process the data sets and algorithms used to generate the anomalies, Eclipse was the natural choice to work with. Below we have a screenshot of the DEBUG perspective from Eclipse which provides the user with the possibility to observe and control the flow of the program dynamically. Another perspective used was the JAVA perspective which allowed the viewing of the errors generated after compiling JAVA code. .................................................................................... 11 2 .................................................................................................................................................. 11 Also using the Navigator provides the possibility to view and scroll thru the whole class hierarchy. We used the Eclipse Galileo version 3.5.1 to develop the JAVA programs. .......... 12 6.1.4. Tortoise SVN ........................................................................................................... 12 During the development of large and complex projects we constantly need to keep track of the changes done to all the parts of the project. Also because the resources with which we develop the project may constantly be changing we need a means of going back to previous versions of source code, documents or experimental results. In older times we needed to keep ourselves an archive for all the changes done for a project. But today we can rely on useful versioning systems. .................................................................................................................. 12 Tortoise SVN is a resource control software application for Microsoft Windows and one of the best standalone Apache subversion clients. While researching with different data sets we needed to keep a clear evidence of the results. Also for the numerous code changes required the Tortoise SVN proved to be very useful. Because it is implemented as a Windows shell extension it is compatible with Windows Explorer. Furthermore because it’s not designed to be integrated for a specific IDE we used it with the tools we required without any dependency problems. .................................................................................................................................. 12 Below we can see in Windows Explorer a visualization of some of the experimental results being administered with Tortoise. The green marked files have the latest revision from the repository and the red marked file has modifications not yet committed to the repository. .... 12 3 .................................................................................................................................................. 12 The real power given by Tortoise SVN was the possibility to see and analyze all the changes that where ever done on the repository files. This means, like in the picture from below, that we could browse any of the old versions of any of the files and do comparisons between them. .................................................................................................................................................. 12 .................................................................................................................................................. 13 7. Experimental Results ................................................................................................. 14 8. Contributions ................................................................................................................ 15 9. Conclusions and Future Work ................................................................................. 16 10. Bibliography ............................................................................................................. 17 11. List of figures ........................................................................................................... 18 12. List of tables ............................................................................................................. 19 13. List of acronyms ...................................................................................................... 20 4 14. Annexes ..................................................................................................................... 21 5 1. Introduction 1.1. The importance of anomaly detection. Anomaly detection is required today in almost every engineering domain. The most usage we can state that is done in insurance or health, critical safe systems, electronic and bank fraud detection or even military surveillance of enemy activities. In data mining it represents also a very important task. Over time algorithms and practices of detecting anomalies have been discovered since the 19th century. But only in recent years the theory has been put in practice thanks to the technology innovations. Many of the developed techniques have been developed specially for specific domains like internet security or analysis of sequence of data but the more powerful ones are more general hence applicable in more than one domain. Anomaly detection is also used for intrusion detection techniques. Actually so far intrusion detection techniques can be split into two main categories: anomaly detection and misuse detection techniques. The anomaly detection techniques such as IDES [13] keep a history of the past activities and mark the ones that deviate from the normal past behavior as possible intrusions. On the other hand the misuse detections systems such as IDIOT [11] and STAT [3] analyze the patterns of known vulnerable points of the systems to identify the intrusions. Because the volume of data and history of many systems is already stored we believe that using data mining techniques to populate the knowledge base of known anomaly detection techniques could be the answer for boosting the efficiency and accuracy. Even more important, the same data mining tools can be used in different domains for building anomaly detection patterns. 1.2. Short description of this paper. 6 2. Problem statement 7 3. Theoretical Foundation Anomaly detection could be informally defined as the process of finding individual objects that are different from the normal objects. Usually these objects instances have unusual properties or patterns with unexpected behavior. We can distinguish different known types of anomalies that have different names and have just slight differences. The most wellknown ones are exceptions, surprises, peculiarities, contaminants, outliers, discordant observations or aberrations. 8 4. State-of-the-art 9 5. Proposed Solution and Research Methodology 10 6. Implementation 6.1. Technologies used 6.1.1. WEKA 6.1.2. The JAVA programming language 6.1.3. Eclipse environment One of the most well known and recognized IDE’s (Integrated Development Environment) is the Eclipse IDE. It is used for a wide variety of programming languages and has many plug-ins suited for any need of analyzing the code or it’s properties both statically and dynamically. As we chose to work with JAVA as a primary language to process the data sets and algorithms used to generate the anomalies, Eclipse was the natural choice to work with. Below we have a screenshot of the DEBUG perspective from Eclipse which provides the user with the possibility to observe and control the flow of the program dynamically. Another perspective used was the JAVA perspective which allowed the viewing of the errors generated after compiling JAVA code. Fig.6.3 – The Eclipse IDE 11 Also using the Navigator provides the possibility to view and scroll thru the whole class hierarchy. We used the Eclipse Galileo version 3.5.1 to develop the JAVA programs. 6.1.4. Tortoise SVN During the development of large and complex projects we constantly need to keep track of the changes done to all the parts of the project. Also because the resources with which we develop the project may constantly be changing we need a means of going back to previous versions of source code, documents or experimental results. In older times we needed to keep ourselves an archive for all the changes done for a project. But today we can rely on useful versioning systems. Tortoise SVN is a resource control software application for Microsoft Windows and one of the best standalone Apache subversion clients. While researching with different data sets we needed to keep a clear evidence of the results. Also for the numerous code changes required the Tortoise SVN proved to be very useful. Because it is implemented as a Windows shell extension it is compatible with Windows Explorer. Furthermore because it’s not designed to be integrated for a specific IDE we used it with the tools we required without any dependency problems. Below we can see in Windows Explorer a visualization of some of the experimental results being administered with Tortoise. The green marked files have the latest revision from the repository and the red marked file has modifications not yet committed to the repository. Fig.6.1 – Visualization of resources administrated with Tortoise SVN The real power given by Tortoise SVN was the possibility to see and analyze all the changes that where ever done on the repository files. This means, like in the picture from below, that we could browse any of the old versions of any of the files and do comparisons between them. 12 Fig.6.2 – Visualization of changes logged with Tortoise SVN 13 7. Experimental Results 14 8. Contributions 15 9. Conclusions and Future Work 16 10. Bibliography [1] Allen, E., Horvath, S., Kraft, P., Tong, F., Spiteri, E., Riggs, A., & Marahrens, Y. “High Concentrations of LINE Sequence Distinguish Monoallelically-Expressed Genes", Proceedings of the National Academy of Sciences, 100(17), pp. 9940-9945, 2003. [2] Barnett V. and Lewis T. “Outliers in statistical data.” John Wiley, 1994. [3] Breiman, L., “Random forests", Machine Learning, 2001, 45:5-3 [4] Dokas, P, Ertoz, L, Kumar, V, Lazarevic, A., Srivastava, J and Tan, P-N “Data Mining for Network Intrusion Detection.” Proc. NSF Workshop on Next Generation Data Mining, Baltimore, MD, 2002 [5] Erman J., Arlitt M., Mahanti A. “Traffic Classification Using Clustering Algorithms“. Proceedings of the 2006 SIGCOMM workshop on Mining network data, University of Calgary, 2006. [6] Fan W. “Cost-senstive, Scalable and Adaptive Learning Using Ensemble-based Methods.”, PhD thesis, Columbia University, Feb 2001. http://www.cs.columbia.edu/~wfan/research.htm#CostSensitiveLearning [7] Garcia-Teodoroa P., Diaz-Verdejoa J., Macia-Fernandeza G., Vazquezb E. “Anomalybased network intrusion detection: Techniques, systems and challenges.” Science Direct, 2009 [8] Ilgun K., Kemmerer R. A., and Porras P. A., “State transition analysis: A rule-based intrusion detection approach.” IEEE Transactions on Software Engineering, 21(3):181– 199, March 1995. [9] Kaplan, E. L. and Meier, P. “Nonparametric estimation from incomplete observations.” Journal of the American Statistical Association, 53, 45748, 1958 [10] Knorr E. and Ng R. “Finding intensional knowledge of distance-based outliers.” Proc. 25th Int. Conf. on Very Large Data Bases, Edinburgh, Scotland, pp. 211-222, 1999. [11] Kumar S. and Spafford E. H. “A software architecture to support misuse intrusion detection.” In Proceedings of the 18th National Information Security Conference, pages 194–204, 1995. [12] Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008 [13] Lunt T., Tamaru A., Gilham F., Jagannathan R., Neumann P., Javitz H., Valdes A., and Garvey T. “A real-time intrusion detection expert system (IDES) - final technical report.” Technical report, Computer Science Laboratory, SRI International, Menlo Park, California, February 1992. [14] Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006. [15] Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., Horvath, S, “Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma." Modern Pathology, Oct 29 2004 [16] Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008 [17] Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two-Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009 17 11. List of figures 18 12. List of tables 19 13. List of acronyms 20 14. Annexes 21 22