1 - "Politehnica" University of Timişoara Faculty of Automation and C

advertisement
"Politehnica" University of Timişoara
Faculty of Automation and Computers
Departament of Computer and Software Engineering
2, Vasile Pârvan Bv., 300223 – Timişoara, Romania
Tel: +40 256 403261, Fax: +40 256 403214
Web: http://www.cs.upt.ro
A NOMALY D ETECTION IN D ATA M INING .
H YBRID A PPROACH BETWEEN F ILTERING AND -R EFINEMENT AND DBSCAN
Dissertation Thesis
Ștefan-Iulian Handra
Supervisors:
Prof. Dr. Eng. Horia Ciocârlie
Optionally (and if there is enough available space), a
representative picture can be inserted here.
Timişoara,
2011
1
Summary
1. Introduction ...................................................................................................................... 6
1.1. The importance of anomaly detection. ....................................................................... 6
1.2. Short description of this paper. ................................................................................... 6
2. Problem statement ........................................................................................................ 7
3. Theoretical Foundation ................................................................................................ 8
4. State-of-the-art ............................................................................................................... 9
5. Proposed Solution and Research Methodology ................................................. 10
6. Implementation ............................................................................................................ 11
6.1. Technologies used ....................................................................................................... 11
6.1.1.
WEKA ..................................................................................................................... 11
6.1.2.
The JAVA programming language ...................................................................... 11
6.1.3.
Eclipse environment ............................................................................................... 11
One of the most well known and recognized IDE’s (Integrated Development Environment) is
the Eclipse IDE. It is used for a wide variety of programming languages and has many plugins suited for any need of analyzing the code or it’s properties both statically and dynamically.
.................................................................................................................................................. 11
As we chose to work with JAVA as a primary language to process the data sets and
algorithms used to generate the anomalies, Eclipse was the natural choice to work with.
Below we have a screenshot of the DEBUG perspective from Eclipse which provides the user
with the possibility to observe and control the flow of the program dynamically. Another
perspective used was the JAVA perspective which allowed the viewing of the errors
generated after compiling JAVA code. .................................................................................... 11
2
.................................................................................................................................................. 11
Also using the Navigator provides the possibility to view and scroll thru the whole class
hierarchy. We used the Eclipse Galileo version 3.5.1 to develop the JAVA programs. .......... 12
6.1.4.
Tortoise SVN ........................................................................................................... 12
During the development of large and complex projects we constantly need to keep track of the
changes done to all the parts of the project. Also because the resources with which we develop
the project may constantly be changing we need a means of going back to previous versions
of source code, documents or experimental results. In older times we needed to keep
ourselves an archive for all the changes done for a project. But today we can rely on useful
versioning systems. .................................................................................................................. 12
Tortoise SVN is a resource control software application for Microsoft Windows and one of
the best standalone Apache subversion clients. While researching with different data sets we
needed to keep a clear evidence of the results. Also for the numerous code changes required
the Tortoise SVN proved to be very useful. Because it is implemented as a Windows shell
extension it is compatible with Windows Explorer. Furthermore because it’s not designed to
be integrated for a specific IDE we used it with the tools we required without any dependency
problems. .................................................................................................................................. 12
Below we can see in Windows Explorer a visualization of some of the experimental results
being administered with Tortoise. The green marked files have the latest revision from the
repository and the red marked file has modifications not yet committed to the repository. .... 12
3
.................................................................................................................................................. 12
The real power given by Tortoise SVN was the possibility to see and analyze all the changes
that where ever done on the repository files. This means, like in the picture from below, that
we could browse any of the old versions of any of the files and do comparisons between them.
.................................................................................................................................................. 12
.................................................................................................................................................. 13
7. Experimental Results ................................................................................................. 14
8. Contributions ................................................................................................................ 15
9. Conclusions and Future Work ................................................................................. 16
10.
Bibliography ............................................................................................................. 17
11.
List of figures ........................................................................................................... 18
12.
List of tables ............................................................................................................. 19
13.
List of acronyms ...................................................................................................... 20
4
14.
Annexes ..................................................................................................................... 21
5
1. Introduction
1.1.
The importance of anomaly detection.
Anomaly detection is required today in almost every engineering domain. The most
usage we can state that is done in insurance or health, critical safe systems, electronic and
bank fraud detection or even military surveillance of enemy activities. In data mining it
represents also a very important task.
Over time algorithms and practices of detecting anomalies have been discovered since
the 19th century. But only in recent years the theory has been put in practice thanks to the
technology innovations. Many of the developed techniques have been developed specially for
specific domains like internet security or analysis of sequence of data but the more powerful
ones are more general hence applicable in more than one domain.
Anomaly detection is also used for intrusion detection techniques. Actually so far
intrusion detection techniques can be split into two main categories: anomaly detection and
misuse detection techniques. The anomaly detection techniques such as IDES [13] keep a
history of the past activities and mark the ones that deviate from the normal past behavior as
possible intrusions. On the other hand the misuse detections systems such as IDIOT [11] and
STAT [3] analyze the patterns of known vulnerable points of the systems to identify the
intrusions.
Because the volume of data and history of many systems is already stored we believe
that using data mining techniques to populate the knowledge base of known anomaly
detection techniques could be the answer for boosting the efficiency and accuracy. Even more
important, the same data mining tools can be used in different domains for building anomaly
detection patterns.
1.2.
Short description of this paper.
6
2. Problem statement
7
3. Theoretical Foundation
Anomaly detection could be informally defined as the process of finding individual
objects that are different from the normal objects. Usually these objects instances have
unusual properties or patterns with unexpected behavior. We can distinguish different known
types of anomalies that have different names and have just slight differences. The most wellknown ones are exceptions, surprises, peculiarities, contaminants, outliers, discordant
observations or aberrations.
8
4. State-of-the-art
9
5. Proposed Solution and Research Methodology
10
6. Implementation
6.1.
Technologies used
6.1.1. WEKA
6.1.2. The JAVA programming language
6.1.3. Eclipse environment
One of the most well known and recognized IDE’s (Integrated Development
Environment) is the Eclipse IDE. It is used for a wide variety of programming languages and
has many plug-ins suited for any need of analyzing the code or it’s properties both statically
and dynamically.
As we chose to work with JAVA as a primary language to process the data sets and
algorithms used to generate the anomalies, Eclipse was the natural choice to work with.
Below we have a screenshot of the DEBUG perspective from Eclipse which provides the user
with the possibility to observe and control the flow of the program dynamically. Another
perspective used was the JAVA perspective which allowed the viewing of the errors
generated after compiling JAVA code.
Fig.6.3 – The Eclipse IDE
11
Also using the Navigator provides the possibility to view and scroll thru the whole
class hierarchy. We used the Eclipse Galileo version 3.5.1 to develop the JAVA programs.
6.1.4. Tortoise SVN
During the development of large and complex projects we constantly need to keep
track of the changes done to all the parts of the project. Also because the resources with which
we develop the project may constantly be changing we need a means of going back to
previous versions of source code, documents or experimental results. In older times we
needed to keep ourselves an archive for all the changes done for a project. But today we can
rely on useful versioning systems.
Tortoise SVN is a resource control software application for Microsoft Windows and
one of the best standalone Apache subversion clients. While researching with different data
sets we needed to keep a clear evidence of the results. Also for the numerous code changes
required the Tortoise SVN proved to be very useful. Because it is implemented as a Windows
shell extension it is compatible with Windows Explorer. Furthermore because it’s not
designed to be integrated for a specific IDE we used it with the tools we required without any
dependency problems.
Below we can see in Windows Explorer a visualization of some of the experimental
results being administered with Tortoise. The green marked files have the latest revision from
the repository and the red marked file has modifications not yet committed to the repository.
Fig.6.1 – Visualization of resources administrated with Tortoise SVN
The real power given by Tortoise SVN was the possibility to see and analyze all the
changes that where ever done on the repository files. This means, like in the picture from
below, that we could browse any of the old versions of any of the files and do comparisons
between them.
12
Fig.6.2 – Visualization of changes logged with Tortoise SVN
13
7. Experimental Results
14
8. Contributions
15
9. Conclusions and Future Work
16
10.
Bibliography
[1] Allen, E., Horvath, S., Kraft, P., Tong, F., Spiteri, E., Riggs, A., & Marahrens, Y. “High
Concentrations of LINE Sequence Distinguish Monoallelically-Expressed Genes",
Proceedings of the National Academy of Sciences, 100(17), pp. 9940-9945, 2003.
[2] Barnett V. and Lewis T. “Outliers in statistical data.” John Wiley, 1994.
[3] Breiman, L., “Random forests", Machine Learning, 2001, 45:5-3
[4] Dokas, P, Ertoz, L, Kumar, V, Lazarevic, A., Srivastava, J and Tan, P-N “Data Mining for
Network Intrusion Detection.” Proc. NSF Workshop on Next Generation Data Mining,
Baltimore, MD, 2002
[5] Erman J., Arlitt M., Mahanti A. “Traffic Classification Using Clustering Algorithms“.
Proceedings of the 2006 SIGCOMM workshop on Mining network data, University of
Calgary, 2006.
[6] Fan W. “Cost-senstive, Scalable and Adaptive Learning Using Ensemble-based
Methods.”,
PhD
thesis,
Columbia
University,
Feb
2001.
http://www.cs.columbia.edu/~wfan/research.htm#CostSensitiveLearning
[7] Garcia-Teodoroa P., Diaz-Verdejoa J., Macia-Fernandeza G., Vazquezb E. “Anomalybased network intrusion detection: Techniques, systems and challenges.” Science Direct,
2009
[8] Ilgun K., Kemmerer R. A., and Porras P. A., “State transition analysis: A rule-based
intrusion detection approach.” IEEE Transactions on Software Engineering, 21(3):181–
199, March 1995.
[9] Kaplan, E. L. and Meier, P. “Nonparametric estimation from incomplete observations.”
Journal of the American Statistical Association, 53, 45748, 1958
[10] Knorr E. and Ng R. “Finding intensional knowledge of distance-based outliers.” Proc.
25th Int. Conf. on Very Large Data Bases, Edinburgh, Scotland, pp. 211-222, 1999.
[11] Kumar S. and Spafford E. H. “A software architecture to support misuse intrusion
detection.” In Proceedings of the 18th National Information Security Conference, pages
194–204, 1995.
[12] Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008
[13] Lunt T., Tamaru A., Gilham F., Jagannathan R., Neumann P., Javitz H., Valdes A.,
and Garvey T. “A real-time intrusion detection expert system (IDES) - final technical
report.” Technical report, Computer Science Laboratory, SRI International, Menlo Park,
California, February 1992.
[14] Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J.
Computational and Graphical Statistics, 2006.
[15] Shi, T., Seligson, D., Belldegrun, A. S., Palotie, A., Horvath, S, “Tumor classification
by tissue microarray profiling: random forest clustering applied to renal cell carcinoma."
Modern Pathology, Oct 29 2004
[16] Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew
Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion
Detection”, Conference paper of the North Carolina State University at Raleigh
Department of Computer Science, Jan 2008
[17] Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two-Stage
Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International
Conference on Data Mining, 2009
17
11.
List of figures
18
12.
List of tables
19
13.
List of acronyms
20
14.
Annexes
21
22
Download