Profiling and Identifying Individual Users by Their Command Line Usage and Writing Sytle Darusalam (100111555) Supervisor Associate Prof Helen Ashman 1 Overview • • • • • • • • Introduction Motivation Literature Review Research Question Methodology Result Contribution Future Work 2 Introduction Profiling ->, it groups things or individuals into categories based on characteristics (N.P.Dau et al., 2000). E.g Profiling -> user usage pattern of computer Profiling -> user identification It aims to identify a user in natural language (Jane Austen and William Shakespeare) and Formal language (command line history) based on the investigation of psychometric user characteristics 3 Motivation • Previous research Biometric characteristic. • The minor thesis extends this by focusing on a psychometric user characteristic. • Research will consider user’s writings in two different scenarios (Natural and Formal language) and can be analyzed with n-gram in order to identify the user. 4 Literature Review • Computer science -> profiling in online social network – Research by Ashman and Holland (on draft). They examined users to identifying Anomaly detection over user model. – Department of electrical and computer Engineering, University of Victoria Canada outline about the use of behavioral biometrics for intrusion detection applications (Ahmed & Traore 2005). • N-gram based analysis – Luo et at (2010) N-gram-based malicious code feature extraction algorithm with statical language model. – N-gram analysis based on author profile also applies in authorship attribution (Keöelj et al. 2003). 5 Research Question The research will answer the questions • Q1: does the use of n-gram analysis to profile users’ writing styles in social network situations allow accurate user identification? a. if so, does it allow both positive and negative identification? • Q2: does the use of n-gram analysis to profile users’ command usage in their command line histories allow accurate user identification? a. if so, does it allow both positive and negative identification? • Q3: if the profiling of both writing styles and command usage allow accurate user profiling, which is the most accurate? 6 Research Question Cont Positive Identification Machine A Machine B Negative Identification Machine A Machine B 7 Methodology • • • • What is N-gram analysis ? N-gram is a language model based on collinear relation (Luo et al., 2010) & ‘N-gram is a subset of overlapping n-sized portion of a series of letter, words, syllables, phonemes or based pairs’ (Ashman and Holland (on draft)). 3-gram, 5-gram, 11-gram and 15-gram is used for analysis. Normalization Data used are percentage, Max-min and Z score T-Test ? Method to compare the styles of two pairs of samples. N-gram (3,5,11 & 15) Normalization A percentage, Maxmin & z Score T-Test (t-Test: paired two sample for means) Result 8 Formal language comparison Positive Identification User1-history1 User1-history3 User1-history2 User1-history4 Negative Identification User2-history1 User4-history1 User3-history1 User5-history1 9 Natural language comparison Jane Austen Positive Identification William Shakespeare Positive Identification Negative Identification 10 Result of Formal Language Comparison Positive Identification (User1 Command Line history for different machines) N-gram Normalizati on Type 3 Gram Percentage 5 Gram 11 Gram 15 Gram Positive Result (Correct Identifica tion) 6/6 Negative Result (False Identificati on ) 0 Max-Min 1/6 Z Score Percentage Total Correct Rate Percen tage 6 100% 5 1 16,6% 6/6 6/6 0 0 6 6 100% 100% Max-Min 1/6 5 1 16,6% Z Score Percentage 6/6 6/6 0 0 6 6 100% 100% Max-Min 1/6 5 1 16,6% Z Score Percentage 6/6 6/6 0 0 6 6 100% 100% Max-Min 1/6 5 1 16,6% Z Score 6/6 0 6 100% 5-gram 11 Result of Formal Language comparison (cont) Negative Identification User1 VS (User2 ,User3, User4, User5) N-gram Normalizat ion Type 3 Gram Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score 5 Gram 11 Gram 15 Gram Positive Result (Correct Identificat ion) 23/30 20/30 19/30 13/30 17/30 14/30 16/30 26/30 16/30 23/30 24/30 24/30 Negative Result (False Identificati on ) 7 8 11 17 8 11 14 4 14 17 16 16 Total Correct 23 20 19 13 17 14 16 26 16 23 24 24 Rate Percent age 76.7% 66.7% 63.3% 43.3% 56.7% 46.7% 53.3% 86.7% 53.3% 76.7% 80.0% 80.0% 5-gram 12 Result of Natural Language comparison Positive Identification Jane Austen writing style N-gram Normalizatio n Type 3 Gram Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score 5 Gram 11 Gram 15 Gram Positive Result (Correct Identificatio n) 18/18 2/18 18/18 18/18 2/18 18/6 18/18 6/18 18/18 18/18 9/6 18/18 Negative Result (False Identification ) 0 16 0 0 5 0 0 3 0 0 3 0 Total Correct 18 2 18 18 2 18 18 6 18 18 9 19 Rate Percentage 100% 11.11% 100% 100% 11.11% 100% 100% 33.33% 100% 100% 50% 100% 5-gram 13 Result of Natural Language comparison (cont) Negative Identification Jane Austen vs William Shakespeare N-gram Normalizatio n Type 3 Gram Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score 5 Gram 11 Gram 15 Gram Positive Result (Correct Identificatio n) 0/16 16/16 0/16 0/16 16/16 0/16 0/16 16/16 0/16 0/16 2/16 0/16 Negative Result (False Identification ) 16 0 16 16 0 16 16 0 16 16 14 16 Total Correct 0 16 0 0 16 0 0 16 0 0 2 0 Average Rate Percentage 0% 100% 0% 0% 100% 0% 0% 100% 0% 0% 12.5% 0% 4.17% 5-gram 14 Result Summary • Formal Language 1. Positive Identification Successful user identification Normalization Type Percentage Max-min z Score Success Total 100% 16,66% 100% 1. Negative Identification Successful user identification Normalization Type Success Total Percentage Max-min 62,50% 72.50% z Score 60,83% 15 Result Summary cont • Natural Language 1. Positive Identification Successful user identification Normalization Type Success Total Percentage Max-min z Score 100% 26,38% 100% 1. Negative Identification Failed to identify user Normalization Type Success Total Percentage Max-min z Score 0% 100% 0% • which is the most accurate? Formal Language Contribution • New methods for user identification in formal language and natural language. • It could enable intrusion detection where intruders masquerade as real users. 17 Future Work • For formal language, trying to compare one machine divided by period of time • Use other gram, e.g. 2,4,6,7,8,9,10,12,13, since each gram gives a different result • User could have more than one writing style • Compare both participants in all possible scenarios. 18 Any Question Thank you 19 References • ALMASSIAN, N., AZMI, R. & BERENJI, S. 2009. AIDSLK: An Anomaly Based Intrusion Detection System in Linux Kernel. Information Systems, Technology and Management, 232-243. • ASHMAN, H. & HOLLAND, S. Profiling and identifying users with n-gram analysis on their command line histories. • BALDUZZI, M., PLATZER, C., HOLZ, T., KIRDA, E., BALZAROTTI, D. & KRUEGEL, C. 2010. Abusing Social Networks for Automated User Profiling. In: JHA, S., • • OMMER, R. & KREIBICH, C. (eds.) Recent Advances in Intrusion Detection. Springer Berlin / Heidelberg. • BHATTACHARYYA, P., GARG, A. & WU, S. F. Social Network Model Based on Keyword Categorization. Social Network Analysis and Mining, 2009. ASONAM '09. International Conference on Advances in, 20-22 July 2009 2009. 170-175. • OYD, D. M. & ELLISON, N. B. 2008. Social network sites: Definition, history, and scholarship. Journal of Computer Mediated Communication, 13, 210-230. • CHA, B. 2005. Host Anomaly Detection Performance Analysis Based on System Call of Neuro-Fuzzy Using Soundex Algorithm and N-gram Technique. Proceedings of the 2005 Systems Communications. IEEE Computer Society. • DWYER, C., HILTZ, S. R. & PASSERINI, K. Trust and privacy concern within social networking sites: A comparison of Facebook and MySpace. 2007. Citeseer. • HUBBALLI, N., BISWAS, S. & NANDI, S. Sequencegram: n-gram modeling of system calls for program based anomaly detection. Communication Systems and Networks (COMSNETS), 2011 Third International Conference on, 4-8 Jan. 2011 2011. 1-10. • KEÖELJ, V., PENG, F., CERCONE, N. & THOMAS, C. N-gram-based author profiles for authorship attribution. 2003. Citeseer. • KESELJ, F. P. D. S. V. & WANG, S. Language Independent Authorship Attribution using Character Level Language Models. 20 MAIA, M., ALMEIDA, J., VIRG\, \#237 & ALMEIDA, L. 2008. Identifying user behavior in online social networks. Proceedings of the 1st Workshop on Social Network Systems. Glasgow, Scotland: ACM. MCKINNEY, S. & REEVES, D. S. 2009. User identification via process profiling: extended abstract. Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies. Oak Ridge, Tennessee: ACM. N.P.DAU, V., RAU, V. & J.TEMPLETON, S. 2000. profiling users in the UNIX OS Environment. PANNELL, G. & ASHMAN, H. 2010. User Modelling for Exclusion and Anomaly Detection: A Behavioural Intrusion Detection System. In: DE BRA, P., KOBSA, A. & CHIN, D. (eds.) User Modeling, Adaptation, and Personalization. Springer Berlin / Heidelberg. RAAD, E., CHBEIR, R. & DIPANDA, A. User Profile Matching in Social Networks. Network-Based Information Systems (NBiS), 2010 13th International Conference on, 14-16 Sept. 2010 2010. 297-304. REDDY, D. K. S. & PUJARI, A. K. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology, 2, 231-239. VOSECKY, J., DAN, H. & SHEN, V. Y. User identification across multiple social networks. Networked Digital Technologies, 2009. NDT '09. First International Conference on, 28-31 July 2009 2009. 360-365. WEI, W., XIAOHONG, G. & XIANGLIANG, Z. Profiling program and user behaviors for anomaly intrusion detection based on non-negative matrix factorization. Decision and Control, 2004. CDC. 43rd IEEE Conference on, 14-17 Dec. 2004 2004. 99-104 Vol.1. ZHANG, B., YIN, J., HAO, J., WANG, S. & ZHANG, D. 2007. New Malicious Code Detection Based on N-Gram Analysis and Rough Set Theory. In: WANG, Y., CHEUNG, Y.-M. & LIU, H. (eds.) Computational Intelligence and Security. Springer Berlin / Heidelberg. 21