Uploaded by bicho madis

Data-mining a mechanism against cyber threats A review

advertisement
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)
Data-Mining a Mechanism against Cyber
Threats: A Review
Shipra Ravi Kumar
Prof. J.S.Jassi
Suman Avdhesh Yadav
Ravi Sharma
Computer Science &Engineering Mechanical Engineering
Information Technology
Information Tech.
shipra.chaudhary85@gmail.com
sumanavdheshyadav@gmail.com
rsharma@gn.amity.edu
jsjassi@gn.amity.edu
According to Vladimir Estivill-Castro, there are
several type of clusters which can’t be generalize so,
various machine learning algorithms are used
depending upon the type of clusters[1]. Different
researchers proposed different models for different
types of datasets supported by various machine
learning algorithms. So, a different cluster varies in
their properties significantly.
Abstract - Data mining is the process in that analyzing
of data from different perspective and summarizes that
data into some useful information which can be used to
enhance the revenue generation, cost cutting etc.. In
data mining, cluster formation plays a vital role which
is data can be divided into different groups. Clustering
is the technique in which grouping is based on similar
type of data relevant to different attributes. WEKA is
the most important tool of data mining which is used to
allocate and clustering of data with use of various
machine learning algorithms. The purpose of this paper
is to compare different algorithms of machine learning
on the subject of types of data set, their size, number of
clusters and cyber privacy platform. We also discuss
different types of cyber threats in computing world.
There are three different types of clustering
algorithms in data mining which preserves security:
Hierarchical clustering algorithm, Expectation
Maximization & K-means Algorithm.
II.
Keywords – MAP, WEKA, Cyber Threats, Intrusion
Detection
I.
K – MEANS CLUSTERING ALGORITHM
This algorithm can be defined as analysis of
clusters in that no. of observations is divided into K
clusters. Thus, each observation related to nearest
mean cluster. Verona cells were formed by the
outcome of partitioning of data space.
INTRODUCTION
Clustering can be defined as the collection of
objects according to the similarities of their
attributes. This collection of data is called as cluster.
Clustering can be used for recognition of patterns,
bioinformatics, and recovery of information and
statistical analysis of data.
K-means is considered as easiest learning
algorithms which provide solution to the well-known
clustering problem. It works for dataset w.r.t number
of clusters with fixed priority. After defining the
centroids, there should be some calculative way to
place these centroids as different positions leads to
different results. So, first priority should be to put
them distant from one another. In the other step we
will consider every individual point that belongs to
dataset given & joined with the immediate centroid,
where not a single point is left unpaired, first step is
over and an initial phase of group is completed.
Different algorithms can be used in clustering
which have different properties and suitable for
different circumstances, some of the attributes on
which clustering algorithms are distance between the
data points, dense areas and distribution. Suitable
parameters or attributes like distance function, type
of distance function for e.g Euclidian distance,
minoski distance or the threshold function depend on
the datasets. Clustering is an automatic process but
needs changes in preprocessing sometimes.
Once the k new centroids received & a new
positioning of similar data set pints & adjacent newly
created centroid has to be done, that leads to loop
978-1-5090-2084-3/16/$/31.00©2016 IEEE
45
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)
clustering is performed in 2 categories: Divisive &
Agglomerative.
Agglomerative (bottom up) - In Agglomerative
clustering bottom up approach is applied. In this
method clustering is done one by one by pairing up
clusters hierarchically
1. We start with single point with one point which is
known as singleton.
2. In the second step, two or more clusters are added
recursively as one move up the hierarchy.
This method terminates when k no. of clusters are
received by the combination of many clusters.
generation. Due to the outcome of generated loop,
positions of k centroids will change one by one till
the next variations are made. That means centroids
are static. It can be described in below steps:
1. Put all K Points into the space presented by the
given objects which has to be clustered and early
group can be presented by these points.
2. A nearest centroid can be allocated by each object.
3. The re-evaluation of centroids can be done once all
the objects have been allotted to their respective
position.
4. till the centroid is at same position repeat Steps 2
and 3. Hence, this results will divide different objects
into groups through which minimized matrix can be
evaluated. NP hard is difficult to resolve. The
commonly used heuristic algorithm must evaluate
local optimum quickly. Heuristic algorithms and
expectation maximization algorithm have one thing
in common, that both uses combination of Gaussian
distribution with iterative refinement approach.
Hence these both techniques uses clustering of
centre; also the approach of K means clustering is
used to find compatible clusters in spatial extent.
Whereas different shapes of clusters allows the
expectation and minimization.
III.
Divisive: In this approach clustering will be start
from one end, recursively we can separate clusters
one by one hierarchically. Generally we can say that
splitting and merging are evaluated in greedy way.
Large data sets become slow by agglomerative
clustering. Top – down approach of divisive
clustering.
1. The first step is initiated by a big cluster.
2. In the second step, large clustering sets can be
partitioned in smaller sets one by one. When K no. of
clusters is achieved the process gets terminated one
by one partitioned into clusters.
In hierarchical clustering a data set of N items is
given which is to be cluster and N*N distance matrix
is prepared based on the distance between data
points.
EM ALGORITHM FOR PRIVACY
An EM algorithm is a redundant method for
finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in
statistical models, where the model depends on
unobserved hidden variables. When the satisfactory
result of K –means algorithms is achieved, then EM
algorithm is applied. Repetition of an ExpectationMaximization algorithm switches between an E-step
performing which is used to evaluate expectation of
log by using latest estimate of parameters and
Maximization. The probability of each cluster
belongs to the probability distribution which is
assigned by EM Algorithm. This algorithm can be
used to identify the number of clusters to generate by
cross validation process or priority to generate them.
IV.
HIERARCHICAL CLUSTERING
In this clusters are building up in hierarchy and
they can be analyzed sequentially. Hierarchical
46
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)
Table 1: Comparison of Algorithms for Security Applications in
Data Mining
Fig 1: Security Model in Data Mining
VI.
V.
MALICIOUS CODE AND INTRUSION
DETECTION
It can be explained as unauthorized attack on
availability of resources, integrity & data
confidentiality. We can categorize these attacks in
two different types: network base attack and host
based attack. In Host-based attack, a system can be
targeted and an unauthorized access on that system or
machine target a machine was tried to accomplish.
Primarily this detection scheme uses simple routines
to get data system call from audit process that is used
to chase system calls performed by every user.
ISSUES IN CYBER SECURITY
We can discuss cyber terrorism here related to the
spoofing of confidential information. This can
happen by security breach and access by
unauthorized user. Vicious software and viruses like
Trojan horse are the reason behind the violation in
security which can leads to antisocial activities in the
world of cyber crime.
The other type of attack is Network-based attack
which does not allow authorized users to work on
different existing networks services in a meaningful
way. In this type of attack detection can be possible
by using network traffic data and continuously
monitoring of traffic address of the system nodes. It
can be categorize in 2 different groups: misuse
detection systems and anomaly detection groups.
There are few more applications which are
included in cyber security to analyze data for auditing
computer applications. We can build a data ware
house that contains data to audit and then by using
different existing data mining tools we can analyze
whether potential anomalies are present or not.
VII.
By using data mining techniques we can restrict
confidential information or data to the legitimate
users and unauthorized access could be stopped. For
detection and prevention of cyber attacks data mining
technique can be used effectively, also, Data mining
can be used to detect and prevent cyber attacks, data
mining also aggravate security issues like privacy and
interference. Security model shown in below figure.
MALICIOUS INTRUSIONS IN DATA
MINING
This includes servers, web clients, operating
systems, networks & databases. Most of the cyber
attacks and terrorism happened because of malicious
intrusion. In malicious intrusion things will process
like someone without nay authorization tries to attack
in the safe network and get the confidential
information. This might be any vicious automated
software or robot made by human or any human
47
2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016)
intruder. Cyber attacks or malicious intrusions is
often beneficial to show analogies of non cyber
computing world i.e. confidential relevant to cyber
terrorism— and apply these attacks on computer
world or networking. Cyber terror increases day by
day worldwide which is shown in below figure.
For cyber security and national security data
mining is a very wide & active area to research. To
identify abnormal patterns various other data mining
techniques can be used like association rule mining
and link analysis. Data mining helps users to make all
kinds of correlations and leads to privacy concerns.
REFERENCES
[1] Masud, M. M., Khan, L, Thuraisingham, B., Wang, X., Liu, P.,
and Zhu, S., “A Data Mining Technique to Detect Remote
Exploits”, In Proc. IFIP WG 11.9 International Conference
on Digital Forensics, Japan, Jan 27-30, 2008.
[2] Khan, L., Awad, M. and Thuraisingham, B. “A New Intrusion
Detection System using Support Vector Machines and
Hierarchical
Clustering”,
The
VLDB
Journal:
ACM/Springer-Verlag, 16(1), page 507-521, 2007.
[3] Lazarevic, A., et al., “Data Mining for Computer Security
Applications”, Tutorial Proc. IEEE Data Mining Conference,
2003.
[4] Abedin, M., Nessa, S., Khan, L., Thuraisingham, B.,
“Detection and Resolution of Anomalies in Firewall Policy
Rules”, In Proc. 20th IFIP WG 11.3 Working Conference on
Data and Applications Security (DBSec 2006),
SpringerVerlag, July 2006, Sophia Antipolis, France, page
15-29.
[5] S. Hofmeyr, S. Forrest, and A. Somayaji, ``Intrusion Detection
Using Sequences of System Calls'', Journal of Computer
Security Vol. 6, pp. 151-180 (1998).
[6] Thuraisingham B., “Database and Applications Security”, CRC
Press, 2005.
[7] R. Agrawal and R. Srikant, "Privacy-Preserving Data
Mining", Proc. of the ACM SIGMOD Conference on
Management of Data, Dallas, May 2000.
[8] M. Atallah, M., E. Bertino, E., A. K. Elmagarmid, A.K., M.
Ibrahim, and V. S. Verykios, ``Disclosure Limitation of
Sensitive Rules'', In Proceedings of 1999 IEEE Knowledge
and Data Engineering Exchange Workshop (KDEX'99) pp.
45-52, November 1999, Chicago, IL.
[9] Dakshi Agrawal and Charu C. Aggarwal, ``On the design and
quantification of privacy preserving data mining algorithms'',
in Proceedings of the twentieth ACM SIGMOD_SIGACTSIGART symposium on principles of Database Systems on
Principles of database systems, 2001.
[10] S. Rath, D. Jones, J. Hale, S. Shenoi, ``A Tool for Inference
Detection and Knowledge Discovery in Databases'',
in Proceedings of the 9th IFIP WG11.3 Workshop on
Database Security.
Fig 2: Graph of Increasing cyber Terror Worldwide.
VIII.
EXTERNAL ATTACKS, INSIDER
THREATS AND CYBER-TERRORISM
Cyber Attacks is the major concern of today. As
we all are aware of this cyber threat which is
increasing day by day with the help of information
available on the Internet.
Cyber threats and cyber attacks occurred on
existing networks and computer framework could
lead the disruption of business. By cyber terrorism it
could estimated that millions of dollars can caused.
Cyber Threats occurred from inside or outside the
organization. If someone from outside the
organization attacks on the computer is known as
outside cyber attack. In this hackers breakdown the
system and cause quos in the organization.
CONCLUSION:
We can conclude that, by the use of data mining
for potential intrusive purpose, it is possible to
identify confidential data. To preserve cyber security
from malicious software, above mentioned clustering
techniques could be used. EM algorithm ensures
privacy without compromising accuracy of results on
the computation and the communication cost. For
real world data EM gives generally appropriate
results.
48
Download