2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016) Data-Mining a Mechanism against Cyber Threats: A Review Shipra Ravi Kumar Prof. J.S.Jassi Suman Avdhesh Yadav Ravi Sharma Computer Science &Engineering Mechanical Engineering Information Technology Information Tech. shipra.chaudhary85@gmail.com sumanavdheshyadav@gmail.com rsharma@gn.amity.edu jsjassi@gn.amity.edu According to Vladimir Estivill-Castro, there are several type of clusters which can’t be generalize so, various machine learning algorithms are used depending upon the type of clusters[1]. Different researchers proposed different models for different types of datasets supported by various machine learning algorithms. So, a different cluster varies in their properties significantly. Abstract - Data mining is the process in that analyzing of data from different perspective and summarizes that data into some useful information which can be used to enhance the revenue generation, cost cutting etc.. In data mining, cluster formation plays a vital role which is data can be divided into different groups. Clustering is the technique in which grouping is based on similar type of data relevant to different attributes. WEKA is the most important tool of data mining which is used to allocate and clustering of data with use of various machine learning algorithms. The purpose of this paper is to compare different algorithms of machine learning on the subject of types of data set, their size, number of clusters and cyber privacy platform. We also discuss different types of cyber threats in computing world. There are three different types of clustering algorithms in data mining which preserves security: Hierarchical clustering algorithm, Expectation Maximization & K-means Algorithm. II. Keywords – MAP, WEKA, Cyber Threats, Intrusion Detection I. K – MEANS CLUSTERING ALGORITHM This algorithm can be defined as analysis of clusters in that no. of observations is divided into K clusters. Thus, each observation related to nearest mean cluster. Verona cells were formed by the outcome of partitioning of data space. INTRODUCTION Clustering can be defined as the collection of objects according to the similarities of their attributes. This collection of data is called as cluster. Clustering can be used for recognition of patterns, bioinformatics, and recovery of information and statistical analysis of data. K-means is considered as easiest learning algorithms which provide solution to the well-known clustering problem. It works for dataset w.r.t number of clusters with fixed priority. After defining the centroids, there should be some calculative way to place these centroids as different positions leads to different results. So, first priority should be to put them distant from one another. In the other step we will consider every individual point that belongs to dataset given & joined with the immediate centroid, where not a single point is left unpaired, first step is over and an initial phase of group is completed. Different algorithms can be used in clustering which have different properties and suitable for different circumstances, some of the attributes on which clustering algorithms are distance between the data points, dense areas and distribution. Suitable parameters or attributes like distance function, type of distance function for e.g Euclidian distance, minoski distance or the threshold function depend on the datasets. Clustering is an automatic process but needs changes in preprocessing sometimes. Once the k new centroids received & a new positioning of similar data set pints & adjacent newly created centroid has to be done, that leads to loop 978-1-5090-2084-3/16/$/31.00©2016 IEEE 45 2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016) clustering is performed in 2 categories: Divisive & Agglomerative. Agglomerative (bottom up) - In Agglomerative clustering bottom up approach is applied. In this method clustering is done one by one by pairing up clusters hierarchically 1. We start with single point with one point which is known as singleton. 2. In the second step, two or more clusters are added recursively as one move up the hierarchy. This method terminates when k no. of clusters are received by the combination of many clusters. generation. Due to the outcome of generated loop, positions of k centroids will change one by one till the next variations are made. That means centroids are static. It can be described in below steps: 1. Put all K Points into the space presented by the given objects which has to be clustered and early group can be presented by these points. 2. A nearest centroid can be allocated by each object. 3. The re-evaluation of centroids can be done once all the objects have been allotted to their respective position. 4. till the centroid is at same position repeat Steps 2 and 3. Hence, this results will divide different objects into groups through which minimized matrix can be evaluated. NP hard is difficult to resolve. The commonly used heuristic algorithm must evaluate local optimum quickly. Heuristic algorithms and expectation maximization algorithm have one thing in common, that both uses combination of Gaussian distribution with iterative refinement approach. Hence these both techniques uses clustering of centre; also the approach of K means clustering is used to find compatible clusters in spatial extent. Whereas different shapes of clusters allows the expectation and minimization. III. Divisive: In this approach clustering will be start from one end, recursively we can separate clusters one by one hierarchically. Generally we can say that splitting and merging are evaluated in greedy way. Large data sets become slow by agglomerative clustering. Top – down approach of divisive clustering. 1. The first step is initiated by a big cluster. 2. In the second step, large clustering sets can be partitioned in smaller sets one by one. When K no. of clusters is achieved the process gets terminated one by one partitioned into clusters. In hierarchical clustering a data set of N items is given which is to be cluster and N*N distance matrix is prepared based on the distance between data points. EM ALGORITHM FOR PRIVACY An EM algorithm is a redundant method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved hidden variables. When the satisfactory result of K –means algorithms is achieved, then EM algorithm is applied. Repetition of an ExpectationMaximization algorithm switches between an E-step performing which is used to evaluate expectation of log by using latest estimate of parameters and Maximization. The probability of each cluster belongs to the probability distribution which is assigned by EM Algorithm. This algorithm can be used to identify the number of clusters to generate by cross validation process or priority to generate them. IV. HIERARCHICAL CLUSTERING In this clusters are building up in hierarchy and they can be analyzed sequentially. Hierarchical 46 2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016) Table 1: Comparison of Algorithms for Security Applications in Data Mining Fig 1: Security Model in Data Mining VI. V. MALICIOUS CODE AND INTRUSION DETECTION It can be explained as unauthorized attack on availability of resources, integrity & data confidentiality. We can categorize these attacks in two different types: network base attack and host based attack. In Host-based attack, a system can be targeted and an unauthorized access on that system or machine target a machine was tried to accomplish. Primarily this detection scheme uses simple routines to get data system call from audit process that is used to chase system calls performed by every user. ISSUES IN CYBER SECURITY We can discuss cyber terrorism here related to the spoofing of confidential information. This can happen by security breach and access by unauthorized user. Vicious software and viruses like Trojan horse are the reason behind the violation in security which can leads to antisocial activities in the world of cyber crime. The other type of attack is Network-based attack which does not allow authorized users to work on different existing networks services in a meaningful way. In this type of attack detection can be possible by using network traffic data and continuously monitoring of traffic address of the system nodes. It can be categorize in 2 different groups: misuse detection systems and anomaly detection groups. There are few more applications which are included in cyber security to analyze data for auditing computer applications. We can build a data ware house that contains data to audit and then by using different existing data mining tools we can analyze whether potential anomalies are present or not. VII. By using data mining techniques we can restrict confidential information or data to the legitimate users and unauthorized access could be stopped. For detection and prevention of cyber attacks data mining technique can be used effectively, also, Data mining can be used to detect and prevent cyber attacks, data mining also aggravate security issues like privacy and interference. Security model shown in below figure. MALICIOUS INTRUSIONS IN DATA MINING This includes servers, web clients, operating systems, networks & databases. Most of the cyber attacks and terrorism happened because of malicious intrusion. In malicious intrusion things will process like someone without nay authorization tries to attack in the safe network and get the confidential information. This might be any vicious automated software or robot made by human or any human 47 2016 1st International Conference on Innovation and Challenges in Cyber Security (ICICCS 2016) intruder. Cyber attacks or malicious intrusions is often beneficial to show analogies of non cyber computing world i.e. confidential relevant to cyber terrorism— and apply these attacks on computer world or networking. Cyber terror increases day by day worldwide which is shown in below figure. For cyber security and national security data mining is a very wide & active area to research. To identify abnormal patterns various other data mining techniques can be used like association rule mining and link analysis. Data mining helps users to make all kinds of correlations and leads to privacy concerns. REFERENCES [1] Masud, M. M., Khan, L, Thuraisingham, B., Wang, X., Liu, P., and Zhu, S., “A Data Mining Technique to Detect Remote Exploits”, In Proc. IFIP WG 11.9 International Conference on Digital Forensics, Japan, Jan 27-30, 2008. [2] Khan, L., Awad, M. and Thuraisingham, B. “A New Intrusion Detection System using Support Vector Machines and Hierarchical Clustering”, The VLDB Journal: ACM/Springer-Verlag, 16(1), page 507-521, 2007. [3] Lazarevic, A., et al., “Data Mining for Computer Security Applications”, Tutorial Proc. IEEE Data Mining Conference, 2003. [4] Abedin, M., Nessa, S., Khan, L., Thuraisingham, B., “Detection and Resolution of Anomalies in Firewall Policy Rules”, In Proc. 20th IFIP WG 11.3 Working Conference on Data and Applications Security (DBSec 2006), SpringerVerlag, July 2006, Sophia Antipolis, France, page 15-29. [5] S. Hofmeyr, S. Forrest, and A. Somayaji, ``Intrusion Detection Using Sequences of System Calls'', Journal of Computer Security Vol. 6, pp. 151-180 (1998). [6] Thuraisingham B., “Database and Applications Security”, CRC Press, 2005. [7] R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining", Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000. [8] M. Atallah, M., E. Bertino, E., A. K. Elmagarmid, A.K., M. Ibrahim, and V. S. Verykios, ``Disclosure Limitation of Sensitive Rules'', In Proceedings of 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99) pp. 45-52, November 1999, Chicago, IL. [9] Dakshi Agrawal and Charu C. Aggarwal, ``On the design and quantification of privacy preserving data mining algorithms'', in Proceedings of the twentieth ACM SIGMOD_SIGACTSIGART symposium on principles of Database Systems on Principles of database systems, 2001. [10] S. Rath, D. Jones, J. Hale, S. Shenoi, ``A Tool for Inference Detection and Knowledge Discovery in Databases'', in Proceedings of the 9th IFIP WG11.3 Workshop on Database Security. Fig 2: Graph of Increasing cyber Terror Worldwide. VIII. EXTERNAL ATTACKS, INSIDER THREATS AND CYBER-TERRORISM Cyber Attacks is the major concern of today. As we all are aware of this cyber threat which is increasing day by day with the help of information available on the Internet. Cyber threats and cyber attacks occurred on existing networks and computer framework could lead the disruption of business. By cyber terrorism it could estimated that millions of dollars can caused. Cyber Threats occurred from inside or outside the organization. If someone from outside the organization attacks on the computer is known as outside cyber attack. In this hackers breakdown the system and cause quos in the organization. CONCLUSION: We can conclude that, by the use of data mining for potential intrusive purpose, it is possible to identify confidential data. To preserve cyber security from malicious software, above mentioned clustering techniques could be used. EM algorithm ensures privacy without compromising accuracy of results on the computation and the communication cost. For real world data EM gives generally appropriate results. 48