2018 IEEE International Parallel and Distributed Processing Symposium Workshops Integrating Cyber Security and Data Science for Social Media: A Position Paper Bhavani Thuraisingham, Murat Kantarcioglu, Latifur Khan Erik jonsson school of engineering and computer science The University of Texas at Dallas Richardson, TX, USA Emails: {bhavani.thuraisingham, muratk, lkhan}@utdallas.edu The goal is to reduce false positives and negatives as well as improve accuracy of the prediction. However, more recently with the ability to handle large volumes of data, there is a lot of interest in applying data science for cyber security. The goal is to first develop a model using training data and then apply this model for test data to determine say whether the code is malicious or not or whether an employee in an organization is a threat to the organization. One of the major challenges that we have to address is that the malware may change patterns. Therefore, we need solutions to handle such zero-day attacks. We have made many contributions to applying data science to cyber security and in particular developed a technique called novel class detection [8]. The idea is to train the model using classification techniques to detect pre-defined classes as well as new (called novel) classes. We have applied our technique to streaming data for many applications including to detect new malware or insiders [7]. As progress is made with the various data science techniques, we believe that detecting unusual patterns in large amounts of streaming data will be vastly improved with fewer false positives and false negatives and higher accuracy. While this area has a lot of promise, it is very difficult to obtain datasets containing various malware. To address this problem, we are collecting datasets and hosting them at the University of Arizona (on a collaborative NSF project) and some preliminary work is discussed in [9]. We need to come together as a community to share the various datasets at the unclassified level so that researchers can carry out experimentation in say malware analysis and insider threat detection. Abstract—Cyber security and data science are two of the fastest growing fields in Computer Science and more recently they are being integrated for various applications. This position paper will review the developments in applying Data science for cyber security and cyber security for data science and then discuss the applications in Social Media. Keywords-Cyber Security; Data Science; Social Media; Fake News; Adversarial Machine Learning; Malware Analysis. I. I NTRODUCTION Cyber security and data science are two of the rapidly growing fields in computer science as well as in related areas such as statistics and social science. Data science integrates areas such as data mining, data management, statistical reasoning, machine learning and high-performance computing. The goal is to analyze large volumes of heterogeneous data and uncover hidden dependencies as well as make predictions. Cyber security is about controlling access to the data and ensuring that the data is not maliciously corrupted. Data science is being applied to cyber security in areas such as intrusion detection, malicious code detection, and insider threat detection among others [1], [2]. Cyber security is also being applied to data science in areas like adversarial machine learning [3], [4]. Some applications have been discussed in social media [5], [6], [7]. This position paper will examine the developments in applying one area to the other (i.e., data science and cyber security) as well as discuss the applications in social media. The organization of the paper is as follows. Section II describes Cyber Security for Data Science. Section III discusses Data Science for Cyber Security. Some of the applications in Social Media are discussed in Section IV. The paper is concluded in Section V. III. C YBER S ECURITY FOR DATA S CIENCE While much progress has been made on applying data science for cyber security, many of the techniques do not take into consideration the attackers behavior in developing the models. We were one of the early teams to focus in this area and develop techniques for what has come to be called adversarial machine learning. In this technique we take into consideration the potential attacks by the attacker such as injecting good data from time to time as well as say varying II. DATA S CIENCE FOR C YBER S ECURITY Data science techniques have been applied to cyber security problems since the 1990s. For example, data mining techniques have been applied for areas such as intrusion detection, malware analysis, and insider threat detection. The idea here is to train the model with the training data (e.g., past experiences) and then test the model with the test data. 978-1-5386-5555-9/18/$31.00 ©2018 IEEE DOI 10.1109/IPDPSW.2018.00178 1163 the packet lengths with respect to network traffic. We train the model taking into consideration the attackers behavior during deployment time. We designed and developed a technique called Adversarial Support Vector Machine (ADSVM) and we have shown that AD-SVM detects many of the attacks carried out by the attacker [3]. More recently there is a lot research on adversarial machine learning and there are conferences devoted to cyber security analytics [10]. The challenge we are faced with is modeling all types of attacks. For example, in our work we considered a limited number of features such as varying packet lengths. But in reality, there could be hundreds of attacks, each resulting in numerous features. Therefore, how do we take into consideration the majority of the attacks? How do we get datasets to test our results? We have many challenging problems to work on in this area. Another related area that is showing promise is trustworthy analytics. That is, how do we ensure that the data analytics techniques are secure and trustworthy [11]? The developments with the Intel SGX technology are providing progress in this area. The challenges are to explore ways to use such technologies to prevent attackers from modifying the data analytics techniques. A third area that has received attention for over a decade is privacy- preserving data analytics [12]. The key idea is how we carry out data analysis and at the same time ensure the privacy. A promising area of research is secure multiparty computation [13]. the same for both there are some serious differences. With the first, malicious users are posting false information about others that could damage their reputation. For example, a man/woman may post a photograph of him/her with a blackeye they claim is inflicted by their spouse, The end result could be damaging to a spouse especially and he/she could be fired from his/her job right away well before the legal process runs its course. This is very distressing. But, on the other hand, if the photo is genuine and it is the spouse who inflicted the injury, then appropriate actions have to be taken. In the second case, some users may post false information about themselves such as traveling to exotic cities and having certain prestigious degrees. Then there are those who may pretend to work for a company even though they have likely been laid off or at least are no longer with the company. This is a form of lying that can affect others. For example, two people apply for a job interview and the candidate who posts false information may get the job. That is, it is likely that some users may post false information and as a result serious decisions may be made based on such information (e.g., job offers). How do we prevent such situations from occurring? Some preliminary research is reported in [15]. Also, it is now well known that social media companies are giving their data to researchers for experimentation and this data is being misused as a result of the researchers sharing the data with organizations without authorization. This is one of the major challenges that is facing social media companies. The question is also whether we need regulations as to what information can be shared by the social media companies. We need to focus on assured information sharing for social media applications [16]. Some are saying that once a person posts personal information on social media then there is nothing the company can do. However, just as in the case of an automobile company, the social media company should discuss the policies with the user and give the user sufficient warning of all the consequences of posting the data. There is also the case of whether a social media user can post information such as photos with the permissions of those in the photos. Also, should there be a policy that no photos should be posted of say children under a certain age? Much of our research on the inference problem is relevant to detecting violations that can result due to the inference and aggregations of the data [17] IV. S OCIAL M EDIA A PPLICATION One application area that has shown a lot of promise to integrate cyber security and data science is social media. Social media applications have to be secure. Social media data has to be analyzed to extract nuggets for social good and numerous data science techniques have been applied to social media applications for over a decade. At the same time, it is important to ensure the privacy of the individuals. Our early work focused on designing access control techniques for social media applications [6]. Next, we designed various social media analytics techniques. For example, our research has focused on analyzing tweets for security as well as marketing applications [7]. It is important for social media analytics to preserve the privacy of the individuals. Therefore, our work has also focused on privacy-enhanced social media analytics [5]. We have shown that even though social media users do not give out private information, it is possible to extract private information from the information posted on their social media sites. Attacks on social media is also an active area of research [14]. One of the major challengers we are faced with today is what has come to be known as Fake News. How do we prevent false and damaging information from being spread on social media? How do we ensure that the social media data is accurate and has high integrity? While the idea is V. C ONCLUSION This paper has discussed the application of data science to cyber security and cyber security to data science. We also discussed the applications in social media. There are many areas for future research. First, we need improved data science techniques that can handle massive amounts of data rapidly. We also need better models for adversarial machine learning that take into consideration a wider range of attacks. Finally, we need to continue working 1164 on integrating cyber security and data science for social media applications. This includes addressing the challenging problem of spreading Fake News. Just as we have done in data privacy during the past decade, we need technologists, policy makers, social and political scientists, and legal experts to work together to develop viable solutions to this problem. [13] M. Kantarcioglu and J. Vaidya, “Secure multiparty computation methods,” in Encyclopedia of Database Systems. Springer, 2009, pp. 2535–2539. R EFERENCES [15] L. Fan, Z. Lu, W. Wu, B. Thuraisingham, H. Ma, and Y. Bi, “Least cost rumor blocking in social networks,” in Distributed Computing Systems (ICDCS), 2013 IEEE 33rd International Conference on. IEEE, 2013, pp. 540–549. [14] Y. Alufaisan, Y. Zhou, M. Kantarcioglu, and B. Thuraisingham, “Hacking social network data mining,” in Intelligence and Security Informatics (ISI), 2017 IEEE International Conference on. IEEE, 2017, pp. 54–59. [1] M. Masud, L. Khan, and B. Thuraisingham, Data Mining Applications in Malware Detection. CRC Press, 2011. [2] B. Thuraisingham, L. Khan, P. Parveen, and M. M. Masud, Big Data Analytics with Applications in Insider Threat Detection. Auerbach Publications, 2017. [16] T. Cadenhead, M. Kantarcioglu, V. Khadilkar, and B. Thuraisingham, “Design and implementation of a cloud-based assured information sharing system,” in International Conference on Mathematical Methods, Models, and Architectures for Computer Network Security. Springer, 2012, pp. 36–50. [3] Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and B. Xi, “Adversarial support vector machine learning,” in Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012, pp. 1059–1067. [17] B. Thuraisingham, T. Cadenhead, M. Kantarcioglu, and T. Cadenhead, “Access control and inference with semantic web,” in CRC Press, 2014. [4] Y. Zhou and M. Kantarcioglu, “Modeling adversarial learning as nested stackelberg games,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2016, pp. 350–362. [5] R. Heatherly, M. Kantarcioglu, and B. Thuraisingham, “Preventing private information inference attacks on social networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1849–1862, 2013. [6] B. Carminati, E. Ferrari, R. Heatherly, M. Kantarcioglu, and B. Thuraisingham, “Semantic web-based social network access control,” computers & security, vol. 30, no. 2-3, pp. 108–115, 2011. [7] B. Thuraisingham, S. Abrol, L. Khan, R. Heatherly, M. Kantarcioglu, and V. Khadilkar, Analyzing and Securing Social Networks. Auerbach Publications, 2016. [8] M. Masud, J. Gao, L. Khan, J. Han, and B. M. Thuraisingham, “Classification and novel class detection in concept-drifting data streams under time constraints,” IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp. 859– 874, 2011. [9] R. Paranthaman and B. Thuraisingham, “Malware collection and analysis,” in Information Reuse and Integration (IRI), 2017 IEEE International Conference on. IEEE, 2017, pp. 26–31. [10] R. Verma and B. Thuraisingham, “Privacy-preserving data mining,” in International Workshop on Security And Privacy Analytics, IWSPA@CODASPY. ACM, 2017. [11] S. Chandra, V. Karande, Z. Lin, L. Khan, M. Kantarcioglu, and B. Thuraisingham, “Securing data analytics on sgx with randomization,” in European Symposium on Research in Computer Security. Springer, 2017, pp. 352–369. [12] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in ACM Sigmod Record, vol. 29, no. 2. ACM, 2000, pp. 439–450. 1165