School of Computer and Information Science CIS Minor Thesis Profiling and Identifying Individual Users by Their Command Line Usage and Writing Style Darusalam Date: Supervisor: AsPr Helen Ashman 1 Abstract Profiling and identifying individual users is an approach to help recognize intrusions in a computer system. User profiles are important in many applications since they record highly user-specific information - profiles are basically built to record information about users or for users to share experiences with each other. Thus, user profiles are used to aggregate relevant information about a user’s activities, and to identify patterns in their behavior. In a computer security situation, anomaly detection can be performed over these user profiles to reauthenticate the user as they interact with the computer. This thesis extends previous research on reauthenticating users with their user profiles. Reauthentication involves continually checking that a user is who they claim to be, and not someone masquerading as a legitimate user. This thesis focuses on the potential to add psychometric user characteristics into the user model so as to be able to detect unauthorized users who may be masquerading as a genuine user. The specific characteristics investigated here are a user’s habitual command line usage, a formal language style, and a user’s prose writing style, a natural language style. This thesis analyses these two writing styles to determine if they can be used to identify a user as part of a reauthentication process. It also aims to determine if either is a more accurate method for identifying users than the other. There are five participants involved in the investigation for formal language user identification. We will analyze their formal language usage style in the form of their command line habits, comparing sequences of commands, and where these are similar enough we can positively identify the user. However if the result is clearly dissimilar we can negatively identify the user, that is, show they are different users. Additionally, we analyze the natural language of two famous writers, Jane Austen & William 2 Shakespeare, in their written works to determine if the same principles can be applied to natural language use. This thesis uses the n-gram analysis method for characterising each user’s style, and can potentially provide accurate user identification. As a result, n-gram analysis of a user's typed inputs offers another method for intrusion detection as it may be able to both positively and negatively identify users. The contribution of this research is to assess the use of a user’s writing styles in both formal language and natural language as a user profile characteristic that could enable intrusion detection where intruders masquerade as real users. 3 Contents Abstract ............................................................................................................................................................ 2 Introduction ...................................................................................................................................................... 6 1.1 Scope and Limits ...................................................................................................................... 7 1.2 Motivation........................................................................................................................... 7 1.3 Field of Thesis...................................................................................................................... 9 1.4 Research Questions............................................................................................................. 9 1.5 Contribution .............................................................................................................................. 9 Chapter 2 .......................................................................................................................................................... 10 Literature Review .......................................................................................................................................... 10 2.1 Computer science profiling in online social networks ................................................................ 10 2.2 n-gram analysis method.............................................................................................................. 16 2.3 Anomaly Detection ..................................................................................................................... 19 Chapter 3 ......................................................................................................................................................... 22 Methodology ................................................................................................................................................ 22 3.1 Experimental Set Up................................................................................................................... 23 3.2 n-gram analysis ........................................................................................................................... 23 3.3 T-Test........................................................................................................................................... 25 3.4 Stage 1 profiling using natural language .................................................................................... 28 3.5 Stage 2 profiling Using formal language ..................................................................................... 29 3.6 Stage the accurate comparison result ........................................................................................ 30 3.7 User Samples............................................................................................................................... 30 Chapter 4 .......................................................................................................................................................... 33 Results ............................................................................................................................................................... 33 4.1 Positive Identification using Formal Language ................................................................... 33 4.1.1 Positive Identification Result of 3 Gram .............................................................................. 33 4.1.2 Positive Identification Result of 5 Gram .............................................................................. 34 4.1.3 Positive Identification Result of 11 Gram ............................................................................ 35 4.1.3 Positive Identification Result of 15 Gram ............................................................................ 36 4.2 Negative Identification using Formal Language .................................................................. 37 4.2.1 Negative Identification Result of 3 Gram ............................................................................. 37 4.2.2 Negative Identification Result of 5 Gram ............................................................................. 38 4.2.3 Negative Identification Result of 11 Gram ........................................................................... 39 4 4.2.4 Negative Identification Result of 15 Gram ........................................................................... 40 4.2.5 User1 and others comparison for 3-gram ........................................................................ 41 4.2.6 User1 and others comparison for 5-gram............................................................................ 44 4.2.7 User1 and others comparison for 11-gram......................................................................... 47 4.2.8 User1 and others comparison for 15-gram.......................................................................... 49 4.3 Summary of Formal Language .......................................................................................... 50 4.3.1 Positive Identification .......................................................................................................... 50 4.3.2 Negative Identification ......................................................................................................... 51 Chapter 5 .......................................................................................................................................... 52 Results ............................................................................................................................................... 52 5.1.1 Positive Identification Result of 3 Gram .............................................................................. 52 5.1.2 Positive Identification Result of 5 Gram .............................................................................. 53 5.1.3 Positive Identification Result of 11 Gram ............................................................................ 54 5.1.4 Positive Identification Result of 15 Gram ............................................................................ 55 5.2 Negative Identification Using Natural Language ................................................................ 56 5.2.1 Negative Identification Result of 3 Gram ............................................................................. 56 5.2.2 Negative Identification Result of 5 Gram ............................................................................. 58 5.2.3 Negative Identification Result of 11 Gram ........................................................................... 59 5.2.4 Negative Identification Result of 15 Gram ........................................................................... 61 5.3 Summary of Natural Language .......................................................................................... 63 5.3.1 Positive Identification .......................................................................................................... 63 5.3.2 Negative Identification ......................................................................................................... 64 Chapter 6 .......................................................................................................................................... 65 Conclusion and Future Work .................................................................................................... 65 6.1 Conclusion ........................................................................................................................ 65 6.2 Future Work...................................................................................................................... 67 References: ...................................................................................................................................................... 69 Appendix A ...................................................................................................................................................... 71 5 Chapter I Introduction Profiling is a way of grouping things or individuals into categories or groups based on characteristics such as situation, appearance, traits (N.P.Dau, Rau & J.Templeton 2000). The term profiling means to get information about a user’s activities, and it is possible to perform anomaly detection over a user profile to allow user identification. This thesis plans to investigate psychometric user characteristics and so as to be able to identify a user. There are many examples of profiling in the area of information technology (N.P.Dau, Rau & J.Templeton 2000), such as profiling users to know about their computer usage patterns. As a result, users are able to use and distribute the system resource more efficiently and to offer best services. Furthermore, in other areas, profiling is used for user identification in Internet-based commerce. Inter-social network functionalities and operations are necessary for several activities such as profiling and identifying user behavior (Raad, Chbeir & Dipanda 2010). User profiles are used to collect relevant information about a user’s activities and anomaly detection is performed over these user profiles. Furthermore, user profiles are important in a range of applications, due to being able to record and compare against user-specific information (Ashman & Holland). According to (Ashman & Holland) user profiles are a good method for reauthentication and intrusion detection. 6 This thesis is organized as follows. A brief introduction is given in chapter I, motivating the work and stating the Research Questions to be answered. Chapter 2 comprises a literature review (background and related work). Chapter 3 describes methods that are used in this thesis. Chapters 4 and 5 discuss the result of investigation, with chapter 4 looking at the results from formal language analysis while chapter 5 looks at the results from natural language analysis. Finally Chapter 6 concludes and discusses future work. 1.1 Scope and Limits This thesis will focus on evaluating the potential of two psychometric user characteristics, namely writing style in both Natural Language (Jane Austin & William Shakespeare) and Formal language (command line histories). To evaluate this particular characteristic this research will use the n-gram analysis method, and will aim to identify users in two ways, positive user identification and negative user identification. The work will however not implement the use of these user characteristics in an intrusion detection system, however it establishes whether they can be used in such a system. 1.2 Motivation Profiling and identifying can help recognize intrusions. According to Pannell and Ashman (Pannell & Ashman 2010), user profiling is already a necessary part of the personalization of information delivery, and they propose it as an approach for identifying attacks to a computer system is by profiling program and user behaviors (Wei, Xiaohong & Xiangliang 2004). Anomaly detection over a user profile can detect when an intruder is masquerading as a genuine user. The thesis extends previous research that implemented an intrusion detection system based on biometric characteristics such as keystroke analysis and mouse use and psychometrics characteristic such as user prose style and favorite web pages (Pannell and Ashman 2010). 7 However this thesis will focus on one potential psychometric user characteristic and will consider whether a user’s writing styles in two different scenarios can be assessed with ngram analysis in order to identify the user. Users’ writings may be in the form of text in a novel, books, blogs, tweets and emails, and this is a form of natural language. On the other hand, users’ writings also occur in the way they interact with computers, issuing commands through a command line interface, and this is a form of formal language. This research will perform the same analysis on data of both types, using exactly the same analysis, and will determine firstly whether either form can be used successfully for user identification, and if so, the research will then will determine which is the most effective of the two. This research will analyze the two different forms of data in two ways, firstly to check whether it can detect when the current user does not match the user profile and is hence an intruder – this is negative identification. The second way is to detect whether the current user can unquestionably be verified as the true user – this is positive identification. Most intrusion detection systems assume that the user is genuine until anomalies or broken rules show otherwise, that is, they only make use of negative identification. However it might be useful to constrain a user’s activities until they positively identify themselves, perhaps not allowing the user to make significant changes until their current login session has been positively identified. In this research, the default position will be that the system has no evidence about the user’s identity, other than the fact that the user managed to log in. Analyzing their activity after logging in should either give positive information that correlates strongly with the user’s profile and confirms their identity, or it should mismatch the profile and the user would then be rejected from the system. 8 1.3 Field of Thesis Computer science; Data Mining; n-gram Analysis; 1.4 Research Questions The research will answer the following questions: Q1: Does the use of n-gram analysis to profile users’ command usage in their command line histories allow accurate user identification? a) if so, does it allow both positive and negative identification? Q2: Does the use of n-gram analysis to profile users’ writing styles in their natural language allow accurate user identification? a) if so, does it allow both positive and negative identification? Q3: If the profiling of both natural language writing styles and command usage allow accurate user profiling, which is the most accurate? 1.5 Contribution This research will contribute the following knowledge: Proposing and assessing the usefulness of two psychometric characteristics for a user profile in an intrusion detection system (IDS) - specifically, comparing formal and natural language psychometric characteristics using n-gram analysis Distinguishing between positive and negative identification, and showing whether this distinction is practical. 9 Chapter 2 Literature Review This section will discuss some previous research into several core aspects of this minor thesis. This literature review will investigate some similar research involving n-gram analysis and additional literature to support features of the project. n-gram analysis is one of the methods that will be used for the project. Reviewing some papers, which use n-gram analysis, demonstrates how n-gram analysis has successfully profiled both features and users in other application areas. This chapter is organized as follows. In section 2.1 we review the literature on profiling users such as in in social networks. In section 2.2 we look at n-gram analysis method and how it is implemented in many applications. In section 2.3 we investigate anomaly detection in some applications. 2.1 Computer science profiling in online social networks There are many researches on computer science that use social networks for user profiling. Social networking is one of the applications that engage the user to be more active and permit user to create and maintain their own web pages, Maia et al (2008). According to Vosecky et al. (2009) varieties of social networking have different manners to display and store information user profile on user’s web profile. Social network has become one of the applications to identify user profile. This research continues previous research by Ashman and Holland (in draft). They examined users to identifying anomaly detection over user model. They classified user model characteristics into two classes. Firstly there are behavioral biometrics, which represents a user’s physical characteristics, for example ability to use mouse, habitual miskeying errors and typing speeds. Secondly there are psychometrics that represents a user’s personal preferences or decision-based characteristics, for example prose style and favored web 10 pages. In addition, n-gram analysis is used to identify users over their command line histories. This research will focus on psychometric user characteristics. Other work involves social networking for identifying user behavior (Maia et al. 2008). Online social networking lets the users to interact with other users. For example, they can upload and view content, rank favorite, choose friends, subscribe to users and do other activity on social networking. They propose a methodology for portraying and identifying users’ behaviors. YouTube is one of the social networking sites they use and used a clustering algorithm to identification of relevant user behaviors. The authors mention about the current web 2.0 services, and the users has different way to interact with other users. They also outline the future work to investigate the user behavior to show appropriate, personalized advertisements and define the different classes of user behaviors and to provide valid performance models for their services. The research provides similar approach to identify user behaviors on online social network but does not aim to identify users but rather to classify them. Pannell and Ashman (2010a) evaluate the intrusion detection system (IDS) to analyse a user’s activities, and propose to help the administrator to identify and quickly respond to the intrusion. The authors identify that anomaly detection may be host-based, as a result an individual user could be profiled by their normal usage patterns. The authors planned to examine further characteristics such as, writing style analysis, and the research in this thesis takes up this future work and aims to enable intrusion detection by author identification. Other work discusses the use of behavioral biometrics for intrusion detection applications (Ahmed & Traore 2005). In this paper they outline a new method for user profiling based on biometrics. According to the authors, these new methods give more accurate information 11 than traditional statistical profiling methods, because the new method is based on a user’s biological characteristics. They also mention about the two types of biometrics such as behavioral biometrics and physiological biometrics. The aim of the research is to detect the misuse of user identification. Another work investigates the use of biometrics on smart cards (Alexandre 1997). The weakness of security with the password use for access control and authentication motivates the work, and the researcher aims to solve the problems with biometrics for user identification likes fingerprints, handwritten signature and voice recognition. In their paper they propose a new method of biometric identification according to keyboard signature. The reliability and simplicity of their approach is suitable user for smart card application. A neural network is applied for supervised and self-organizing method for estimated the efficiency and performance. However they do not provide the result of their approach due to lack of investigation. They also not mention the future work of this research. Another interesting research by Pepyne, Hu & Gong (2004), analyses user profiling for computer security in particular users such as insurance adjusters and bank tellers. They use logistic regression modeling methods and queuing theory for profiling user behavior in computer usage. The aims of their work are to profile a particular user through their habitual use of computer in the regular and same way. The paper also mentions their demonstration in the use of logistic regression modeling, queuing theory and profiling for misuse detection and intrusion for a particular user of the computer. The result is the use of computer in this particular user could be effective for misuse detecting and intrusion. The limitation of the system is that the approach is not based on real time detection since the system can make a decision when the system finish. The future work is not discussed in this paper to solve the problem. 12 Another work investigates user profiling based on tag based in social media recommendation (Hung et al. 2008). The authors mention that tagging is regularly using currently in some social media webpages. The paper explains about the new aims for profiling user based on the tags that link to personal content. The experiment is involved 42,463 users which were already collected for bookmarks and associated tag that use for compared in variation view. The paper is not clear enough to discuss the future work for extend their research. A further work outlines the purpose of user profile is to gather the related information based on user interests (Reformat & Golmohammadi 2009). The ontology-based semantic similarity method is used to extend and sustain a user profile based on web access behaviour of user in music domain. However the lack of data support to justify whether the method is effective and no future work is discussed to develop the research. Other researchers evaluate users in social network to identify related information by use a quantitative method base on principal component analysis. The interested of social networking lets millions of users are register to social networking such as YouTube. As a result social network is becoming important places for business and marketing advertisement. To utilize the social networking as a advertisement, it is important to classify the user and to identify related users. School of Computing and Information Technology, University of Wolverhampton, UK outline about the demographic of MySpace member profile, the authors use two samples 15,043 and 7,627. The analysis focus on finding and conjectures in social networking, for instance women members more interested in friendship, however men member interested in dating. The authors conclude that, users in MySpace are typically 21, single and use a public profile, 13 tent to be more interested on friendship and logging one in a week to keep in touch with friends member. One of the researches is to identify individual user by process profiling (McKinney & Reeves 2009). They investigated Intrusion detection systems by collecting data from employee in small organization during 3 weeks. They examine computer usage from each employee to find a user profile by use Naïve Bayer classifier. The authors outline that with this method they successfully identify 98% of users and the error rate is 4%. The research shows different method to find individual user profile, however our research will use n-gram analysis for author identification. Another research from (N.P.Dau, Rau & J.Templeton 2000) examined the UNIX operating system to identify the user based on the login host, the login time, the command set and command set execution time of the profiled user. They outline two essential points, firstly is the user host-dependent which means one user could use different profiles on different host. Secondly, profile drift occurred over time that is divided into two ways one is force profile drift and natural profile drift. This research is different with our research which focuses on Psychometric user characteristics. However the issues raised in that work regarding profile drift will be relevant to future work after this project. Another piece of related research in regarding identifying user in social network that concern in trust and privacy (Dwyer, Hiltz & Passerini 2007). The author research about Privacy and trust on social networking sites especially in MySpace and Facebook. The authors report that both MySpace and Facebook have the same issues in privacy. For example Facebook members are willing to share their information to their members as well as MySpace. The result show that in online interaction, trust is not important to create new 14 relationship because is not face-to-face interaction. In conclusion, the privacy and trust in social networking is not concern yet for behaviour and activity. This paper shows another interesting type of research that use social network for identifying user. Another interesting research focuses on the connection of network topology and semantic similarity of user keywords (Bhattacharyya, Garg & Wu 2009). Categories of keyword and the notion of the distance among multiple categories trees and keyword across were used in a forest model. After that the authors use the keyword distance to find the similarity function between a pair of users and how social network topology could be designed based on similarity. Finally, by used a simulated social graph they validated social network topology model that contrast with the real social graph dataset. In conclusion, keyword offers the effective way to analysing and modelling online social networking. In future work they will explore other methods that are focus on machine learning technique to against the forest structure. This research use connection semantic similarity of user keywords and our research will analyse prose text over command line. Work by Takeda et al (2000) outline characteristic expression in literary work. Their problem is, take literary work as positive examples (first writer) and negative examples of works by another writer especially in Japanese Poems (Waka poems and prose texts). The method is to create a sequence list of substring of writer goodness, and identify the first list by human expert. They propose to assist human expert in two ways, one is restrict the prime substring and focus on the string and content for a way of browsing. They have successful identify Waka poems but not for prose texts. To get better identify especially for prose text it will be their future work. Ours research shows similar approach to the adaption of prose text for characteristic expression in literary work. 15 There is also research about the misuse of social networks for Automated User Profiling (Balduzzi et al. 2010). Their analysis the users’ weakness when registered to the social networking such as Facebook by used their email address. They start to collect 10.4 million of email addresses and they successfully automatically identify 1.2-million user profile with related with these addresses. Very good analysis is present on the paper by using several popular social networking such as Facebook, Myspace, XING and Twitter. Raad et al. (2010) investigate the solution of inter-social network especially in functionalities and operations. The researchers also concentrate on the user profile matching. This paper they use a framework to match the user profile in social networks. The framework is capable to match the biggest number of user profile that refer to the same person which is the current approaches are unable to detect. Other research investigates how to identify a user based on similar profile (Vosecky, Dan & Shen 2009). They used social networking such as MSN and Facebook to collect user profiles. Users profiles are used to create tools especially for a profile comparison, to decide the similar profile is belonging to the same person or not. The use of vector-based comparison algorithm is a method to compare each user profile. The researcher also evaluates the result of profile comparison algorithm in two phases training and testing. 2.2 n-gram analysis method There are many papers that use n-gram analysis method and it is implemented in many applications. n-gram analysis that use for many purposes, including computer virus detection, author profile and language independent authorship. n-gram analysis is one of the methods that will be used for this project. According to Luo et al (2010), n-gram method is ‘language model based on collinear relation’. Some of the n-gram analysis is use in many purposes that we will explore here and evaluate for this project. 16 Luo et al (2010) outline the use of an n-gram-based malicious code feature extraction algorithm with statistical language model. By using trigram (3-gram) model they can characterise malicious code features and hence detect the virus. As a result they can reduce the time and space of computer rather than detect the virus from heuristics scheme that is costly and ineffective. The use of n-gram analysis offers efficiency and correctness in the analysis of malicious code. Another approach that use n-gram analysis for virus detection (Reddy & Pujari 2006). They merged some classifiers with the use of Dempster Shafer theory such as SVM, IBK and Decision Tree to get accurate classifiers rather than use one Theory. However using n-gram analysis for virus detection lacks semantic awareness. As a result they had difficulties to analysis the appropriate n-gram they find. There is also an n-gram analysis method used for automatically detect malicious code (Zhang et al. 2007). Experiments were carried out by collecting 201 different windows executable files (109 benign codes and 92 malicious code). The result showed that by using n-gram method they could successfully distinguish between malicious code and benign code. Other research is in area of anomaly detection that use n-gram modelling to create normal profile (Hubballi, Biswas & Nandi 2011). They outline an investigation to build a program based on anomaly detection by use of Occurrence Frequency model. The model is effective in short system call sequences. In addition an effect of the method is to build normal program model that can be applied in some level of infection in the training dataset. The authors also mentioned the advantage of the method is that the detection becomes immune to accidental ‘infection’ in the training dataset. n-gram Tree Method is an effective 17 way to create a profile of normal behaviour. The benefits of n-gram tree analysis are that it is easy to use and fast operational. Abou-Assaleh et al. (2004), also use n-gram analysis to detect malicious code. To produce an automatic signature from benign software and malicious code they utilize an n-gram analysis method. It is because n-gram analyses are able to classify hidden malicious code and benign code. Thus, the performance n-gram analysis method is 90% correctness in training data for malicious code and benign code and 91% correct for five-fold cross validation. n-gram analysis based on author profile also applies in authorship attribution (Keöelj et al. 2003). The researchers are outline-automated authorship which indirect relationship for create author profile as vectors of feature language model, similar and weight. n-gram analysis is use as their approach to get author profile and language independent. The experiment is using some languages for instance English, Greek and Chinese data to generate the effective of approach in language independence. Thus, for the Greek data sets they get better uniformly than previously reported. Another research uses character level language model for authorship attribution (Keselj & Wang). They examined assisted authorship with character rank n-gram analysis methods. The authors explained language independence and theoretical principle in simple way. To show the effective answers from both approaches they used experimental result on English, Greek and Chinese Data. As a result, their approach showed the performance in every case from different achieves states. For example, there was improvement of 18% accuracy for Greek data set during uses simple method than in previous investigations. 18 In summary, n-gram analyses have many kinds of application, for instance cryptanalysis, malicious executable detection, language classification and randomness. In other literature, n-gram analysis is used for many purposes, including computer virus detection, author profiling and language independent authorship. n-gram analysis is one of the methods that will be used for this project. 2.3 Anomaly Detection The purpose of the project is to apply anomaly detection over user profiles. According to Grzech (2006) anomaly detection refer to a fundamental of intrusion detection system. In addition, we look at other work to compare how profiling and identifying for anomaly detection use different methods, which we will explore here. Grzech (2006) examines different architectures of anomaly detection - it could be as multiagent systems that support classification system to determine the activity as normal or abnormal detection. The author also provides the simple illustration hierarchical architecture of a spread anomaly detection system, which it is possible to implement in the structure of a multiagent decision supporting system. The author explains detail about Anomaly detection which divided in two categories are normal and abnormal. The example of hierarchical anomaly detection system provided to give brief example. There is research that investigates the anomaly based intrusion detection in Linux Kernel (Almassian, Azmi & Berenji 2009). A sufficient feature list has been arranged to difference between normal and abnormal behaviour. The model used is to introduce new tools to the Linux kernel as protection module that function to log initial data to prepare features list. Recognize and classified input vector was used support vector machine (SVM). The evaluation was implemented on the research by use three experiments, including one-class 19 SVM, Binary Classification and Sequence of delayed samples, however future work is not discussed for further study. Work by Wang et al (2004) outline the used of non-negative matrix factorization (NMF) to profile user behaviours and normal program for anomaly detection. This new manner audit data flowed to system call and used UNIX commands as information source. This manner telling the normal program and user behaviour was build according to deviation and features, from user behaviour and normal program above a predetermined threshold is call anomaly detection. The authors also implemented methods to test with the system call data from AT& T research lab, Unix command data and University of New Mexico. As a result, the aim of the method is improved computational expense, detection accuracy and carrying out as real time intrusion detection. The advantages and disadvantages of NMF are also mentioned for a comparison to get effective result from the analysis. Okazaki et al. (2002) research two models for intrusion detection. One is Anomaly Intrusion detection (AID) and Misuse Intrusion detection (MID). Both model analysis the statistic of a process in normal term and user behaviour, and then verify whether the system whether operated in a dissimilar manner. Intrusion detection method based on anomaly intrusion detection are able identify a new intrusion method. On the other hand, it is necessary to update the statistic in normal use and the data to telling users behaviour. An MID is needs some system resource to identify intrusion detection. Another research that outline system call to detect anomaly detection system by focus on Neuro Fuzzy learning and soundex algorithm (Cha 2005). It is used to design and change the variable length data and feature selection into a fixed length learning system. The two methods Neuro-Fuzzy and n-gram are used for anomaly intrusion detection. To detect the 20 intrusion, they classified the session and generated hosts’ behaviour term by changing the variable length data to a fixed length pattern. 21 Chapter 3 Methodology This research aims is to identify a user especially in Natural language (Jane Austen and William Shakespeare writing style) and Formal Language (command line history). This thesis will use the n-gram analysis method for author identification – this is an established authorship attribution method (as discussed in 2.2 above). Furthermore the research also evaluates how quickly the system learns this characteristic of the user model. The structure of the method in the research is indicated below. In section 3.1 in the experimental set up, we provide information how the software works for counting n-gram for natural language and formal language. 3.2 in this section we will provide short information about n-gram analysis and give the example of using 3-gram in the binary and sentence. 3.3 We provide information about t test type and what t test type we use. 3.4 we show how the n-gram analysis method will be used to identify users by their use of natural language. In section 3.5, we show how the n-gram analysis method will be used to identify users by their use of formal language, such as in commands issued in a command line history. In section 3.6 we show how the accuracy of the two methods will be compared. The section 3.7 we provide information about user study. 22 3.1 Experimental Set Up The implementation part is explained about how the application produced the n-gram frequency. This software application was written in java programming language. There are two classes in this software, “n-gram.java” and “Data.java”. The program running with the command “java n-gram [n]” n is the value of n-gram. This software will produce the n-gram frequency that placed in the Comma Separated Value “csv” folder and distributed to Microsoft Excel. In ‘csv’ folder contain the history of data which txt. Formatted. We will use this software for counting the n-gram of history of data from the user writing style. This software was given from previous research (Ashman and Holland). We use the software for perform four types of n-gram analysis, namely 3-gram, 5-gram, 11-gram and 15-gram. 3.2 n-gram analysis An n-gram is a contiguous sequence of n letters, words or phonemes. For example size 1 of n-gram refer to unigram, size 2 of n-gram refer to bigram, size 3 of n-gram refer to trigram, size 4 refer to four-gram and in the general case is called an n-gram. An n-gram analysis is able to count the frequency of n-grams in a given file. For example, in the binary string level 3-gram such as 1110010000101010010000 has the following character-level trigrams 111, 110, 100, 001, 010, 100, 000, 000, 001, 010, 101, 010,………………000 And in the sentences “in this work we aim to get the certain knowledge”, has the following word-level 3-grams: In this work This work we Work we aim We aim to 23 Aim to get To get the Get the certain The certain knowledge, And for the phrase “in this work”, has the following character-level 3-grams: In ,n t, th, thi, his, is , s w, wo, wor, ork This project uses varying sizes of n-gram such as 3-gram, 5-gram, 11-gram and 15-gram. Firstly, we will evaluate the use of n-gram analysis of user generated formal language such as their command line histories to profile users’ command usage in their command line histories. Secondly, we will evaluate the use of n-gram analysis of natural language to profile users and to ensure the accurate user identification. After that we will compare each writing style from each user and see how different or significance of their pattern in term of natural language and formal language. Next we will visualise their n-gram patterns graphically to view their frequency pattern. 24 3.3 T-Test The t-test is a method that can be performed to decide whether two data sets (samples) are similar or dissimilar and to conclude whether they could have come from the same population. It assesses whether the means of two groups are statistically different from each other. This analysis is useful when we want to determine whether the means of two groups are similar or different. We will use t-tests to assess both natural language and formal language samples, between two samples from the same user (for positive identification purposes) and between two samples from different users (for negative identification purposes). We next consider which form of t-test is appropriate to this research. Type of t-test One sample t-test The one-sample t test is used to decide whether a specific sample comes from a specific population. For example, when we want to know about a specific sample of university student is similar to or different from university student in general. In the current research we are comparing series of words or commands and while it may later be feasible to identify a user from a single n-gram value, at this early stage it is more appropriate to decide whether individuals can be identified from larger quantities of their writings. Independent t-test The independent t-test, or two sample t-test, is used to determine whether two samples are statistically similar or different to each other between the means in two unrelated groups. For example, when we want to know between university students female and male are different or similar on some psychological characteristic. In this 25 research, the samples may not be unrelated, especially when comparing two samples from the same user. Dependent t-test The dependent t-test, also called the paired-group t test, correlated-group t test, matched- groups t test or dependent-group t test. This t test is used to compare two related samples (matched or related in same way) that are both measured once or the same sample measured on two separate occasions. For example, when we want to know how the effect of using a particular drug for insomnia, for the patient is similar or different after consume the drug. In this case we will see how the effects of the drug for the patient before and after consume the drug. This is highly suitable to this research as we need to positively identify a user by comparing a current sample of the user’s writing to an older sample of their writing. From the explanation above we conclude that the dependent t-test or paired group t-test is the most suitable method for test our investigation. We use the t test by proposing the following competing hypotheses: The test hypothesis is the means of population behind the different of two samples. The null hypothesis is the means of population behind the similarity of two samples. A probability value p is output by the t-test. The result of probability value is comparison to the chosen level of significance α to conclude the test result. A common default is α = 0.05: If the probability value is equal or less then the level of significance, we can reject the null hypothesis and conclude that the two samples of writing are different If the probability value is more than the level of significance, we can accept the hypothesis and conclude that the two sample of writing style are the same. 26 Normalization of samples Before performing any t-test we will see the distribution of data collection from each gram whether the distribution of data is normal or nonnormal. If the data nonnormal we will transform the data to the normal data. This is because the samples we are analyzing are of radically different size. While it would be possible to choose subsamples from each sample so that each subsample is an identical size, we elected to normalize each whole sample instead, as command lines users in particular may have different tasks at different times, and the subsamples may not accurately reflect the user’s command line habits in a subsample. By normalizing the samples, we most accurately preserve each user’s writing styles, but at the same time cast them into the same numerical range so that different sample sizes do not confound the results. There are three type of normalization we use to make normal distribution, a percentage normalization, max-min and Z score. Firstly, percentage normalization is counting the each value of the n-gram and divided by total value of all n-gram and time to one hundred. Secondly, max-min normalization is counting the total value all the n-gram and divided by total number from the reduction of maximum number and minimum number of n-gram. Lastly, the z score normalization is counting each value of the gram reduction by average total all the n-gram and divided by standard deviation. We will assess all three normalization methods in this research to determine which is most appropriate for the task of identifying users. 27 3.4 Stage 1 profiling using natural language This natural language stage will evaluate the use of n-gram analysis to profile users according to their use of natural language. Research question 1: Does the use of n-gram analysis to profile users’ writing styles in social network situation allow accurate user identification? Study the first study will analysis the use of n-gram analysis to profiling users’ lead to accurate user identification. As a result, it allows identifying positive and negative users. We need to create n-gram spectra of known users to populate their profile. So for each user, we will create many n-gram spectra, each for different values of n. Each n-gram spectrum will consist of an indexed list of values, where the index runs from 0 to 26 n-1 – that is, there is an entry in the list for every possible combination of letters of the alphabet. For example, A=0 B= 1 C=2 . . . Z = 25 3-grams, 3-letters at a time all possible value between AAA BAA CAA………… ZAA AAB . . . AAZ Assign each letter a value 0……25, so each 3-gram in a number in base 26 calculate. 28 Eg. ABC = 0 x 26^2 + 1 x 26 ^ 1+ 2 x 26 ^0 0 1 2 = 0 + 26 + 2 = 28 Index of a n-gram in this calculated value eg. AbC = 28 index in the list for the 3-gram Note that we will only be doing this task for the letters a..z initially (ignoring case) but if the method is found to be promising, it will be extended to include all ASCII characters. For each sample, we calculate an n-gram spectrum of the text . We then compare n-gram spectra generated by the same author but being different samples (in this case, different books), as well as comparing n-gram spectra generated by different authors. As the authorship is ‘known’, the comparison will determine whether the method is accurate enough to identify when the authors are the same or are different. That comparison will be done via a t-test. For a given value of n, we calculate the n-gram spectrum of the current user’s input text, compare it with the spectrum with the same value of n in the user’s profile, and if the t-test shows they have little difference, then the user is positively identified (i.e. the same author), but if the differences are significant, the user will be negatively identified (i.e. different authors). Part of the research is to decide the most useful values of n in the n-gram analysis, for example whether shorter or longer n-grams will give the most accurate identification. 3.5 Stage 2 profiling Using formal language This formal language stage will evaluate the use of n-gram analysis to profile users’ command usage in their command line histories. Research question 2: does the user of n-gram analysis to profile users’ command usage in their command line histories allow accurate user identification? 29 Study this study will work in exactly the same way as stage 1. The only difference is that we use of data from command line histories. For one user, we have four distinct samples which allow us to test for positive identification, while we have sample from four other users which allow us to test for negative identification. 3.6 Stage the accurate comparison result This stage will compare the two methods from the previous sections and identify which one is the most accurate and under which circumstances. Research question 3: if the profiling of both writing styles and command usage allow accurate user profiling, which is the most accurate? Study this part will assess the method for both formal and natural language, and decide if either is feasible for user identification. If so, the method will be useful for intrusion detection as it would be able to both positively and negatively identify users. 3.7 User Samples In this part we analyze two forms of writing style: natural language and formal language. a. Natural Language There were two famous authors used in the natural language identification. Firstly, William Shakespeare, we take tree famous novel and one poem from his writing. They are Romeo and Juliet, Julius Caesar, Sonnets and Hamlet. Secondly, we take Jane Austen’s writings Emma, Pride and Prejudice, Sense and Sensibility and Mansfield Park. The figure below shows how we compare the samples: 30 Pride and Prejudice Sense and Sensibility Romeo and Juliet Julius Caesar Mansfield Park Emma Sonnet Hamlet Jane Austen William Shakespeare Figure 1: method for comparison of natural language samples Figure 1 shows how we compare both authors’ writing styles. Firstly, we will see the result of one author. We compare each of Jane Austen’s writings to each other, using 3-gram, 5-gram, 11-gram and 15-gram analyses. We then use the t-test to measure their similarity and if the t-test for both pairs in each comparison shows they are from the same author, we have successfully performed a positive identification. Secondly, we will do the same procedure for William Shakespeare’s writings. We will then compare the writing styles of each of Jane Austen’s works with each of Shakespeare’s and if the t-tests indicate they are different, then we will have successfully performed a negative identification. 31 b. Formal Language There were five users involved in this experiment. One example of formal language is command line history, where the user interacts with the computer through a command line. By use n-gram analysis we will identify those user’s ‘writing style’, namely their command line usage habits. The figure below is shows how we compare each of our formal language samples: FORMAL LANGUAGE User1 User1-history1 User2-History1 User1-history2 User3-History1 User1-history3 User1-History4 User4-History1 User5-History1 Figure 2: method for comparison of formal language samples We will follow the same procedure for formal language as for natural language, namely we will analyze each sample, and compare samples by the same user to see if we can achieve positive identification. We will then compare the samples from different users to see if we can achieve negative identification. 32 Chapter 4 Results 4.1 Positive Identification using Formal Language 4.1.1 Positive Identification Result of 3 Gram a. Success and fail for positive identification of T-Test with the same participant (1 person) Normalization a percentage of n-gram count Sample 1 Sample 2 User1-history1 User1-history1 User1-history1 User1-history3 User1-history3 User1-history4 User1-history3 User1-history4 User1-history2 User1-history4 User1-history2 User1-history2 A percentage normalization 1 1 1 1 1 1 Max-min Z score normalization normalization 2.27E-08 1.86E-11 0.194325 9.42E-11 7.23E-09 1.3E-08 1 1 1 1 1 1 Table 1: Positive Identification result of 3-gram As shown on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation mistakenly suggests that only User1-history1 and User1-history2 are the same but the others are different. 33 4.1.2 Positive Identification Result of 5 Gram a. Success and fail for positive identification of T-Test with the same participant (1 person) Sample 1 Sample 2 User1-history1 User1-history1 User1-history1 User1-history3 User1-history3 User1-history4 User1-history3 User1-history4 User1-history2 User1-history4 User1-history2 User1-history2 A percentage normalization 1 1 1 1 1 1 Max-min Z score normalization normalization 0.00915 1.49E-16 0.287977 1.17E-05 0.004408 2.34E-10 1 1 1 1 1 1 Table 2: Positive Identification result of 5-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that only User1-history1 and User1-history2 are the same but the others are different. 34 4.1.3 Positive Identification Result of 11 Gram a. Success and fail for positive identification of T-Test with the same participant (1 person) Sample 1 Sample 2 User1-history1 User1-history1 User1-history1 User1-history3 User1-history3 User1-history4 User1-history3 User1-history4 User1-history2 User1-history4 User1-history2 User1-history2 A percentage normalization 1 1 1 1 1 1 Max-min Z score normalization normalization 3.31E-38 3.32E-14 4.74E-05 0.227777 2.68E-06 0.001243 1 1 1 1 1 1 Table 3: Positive Identification result of 11-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that only User1-history3 and User1-history4 are the same but the others are different 35 4.1.3 Positive Identification Result of 15 Gram a. Success and fail for positive identification of T-Test with the same participant (1 person) Sample 1 Sample 2 User1-history1 User1-history1 User1-history1 User1-history3 User1-history3 User1-history4 User1-history3 User1-history4 User1-history2 User1-history4 User1-history2 User1-history2 A percentage normalization 1 1 1 1 1 1 Max-min Z score normalization normalization 2.1E-168 4.1E-118 2.8E-151 0.304123 0.000854 0.041853 1 1 1 1 1 1 Table 4: Positive Identification result of 15-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that only User1-history3 and User1-history4 are the same but the others are different. 36 4.2 Negative Identification using Formal Language 4.2.1 Negative Identification Result of 3 Gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 User3-history1 User4-history1 User5-history1 User4-history1 User5-history1 A percentage normalization 1.43E-87 1.2E-183 5.3E-207 1.19E-23 8.04E-34 Max-min normalization 0.328289 1 1 1.3E-41 0.862822 Z score normalization 0.998179 0.830453 0.893169 0.838947 0.896617 User2-history1 User2-history1 User2-history1 User3-history1 User3-history1 User4-history1 User5-history1 0.053893 1 0.93775 Table 5: Negative Identification result of 3-gram As show on the table above the result of normalization a percentage of n-gram counting all the participants are different except User4-history1 and User5-history1 the t-test say they are the same person. Min-max normalization claims that only User3-history1 and User4history1 are different and the rest are the same person. However the Z score claims that all participants are the same person. 37 4.2.2 Negative Identification Result of 5 Gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 User3-history1 User4-history1 User5-history1 User4-history1 User5-history1 A percentage normalization 1 1 1 1 1 Max-min normalization 0 0 0 0.003366 7.2E-06 Z score normalization 1 1 1 1 1 User2-history1 User2-history1 User2-history1 User3-history1 User3-history1 User4-history1 User5-history1 1 0.274439 1 Table 6: Negative Identification result of 5-gram As show on the table above the result of normalization a percentage of n-gram count and Z score claims 100% they are the same person. However Max-min normalisation claims that only User4-history1 and User5-history1 are the same person but the others are different. 38 4.2.3 Negative Identification Result of 11 Gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 User3-history1 User4-history1 User5-history1 User4-history1 User5-history1 A percentage normalization 1 NA 1 NA 1 Max-min normalization 0 NA 0 NA 1.07E-08 Z score normalization 1 NA 1 NA 1 User3-history1 User2-history1 User2-history1 User3-history1 User3-history1 User4-history1 User5-history1 NA NA NA Table 7: Negative Identification result of 11-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show User2-history1 vs User3-history1, User3-history1 vs User5-history1 and User3history1 vs User5-history1 they are the same person whereas the others are different. However Max-min normalisation shows that only User3-history1vs User3-history1 and User3-history1 vs User5-history1 are the same person but the others are different. 39 4.2.4 Negative Identification Result of 15 Gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 User3-history1 User4-history1 User5-history1 User4-history1 User5-history1 A percentage normalization NA NA 1 NA NA Max-min normalization NA NA 0 NA NA Z score normalization NA NA 1 NA NA User2-history1 User2-history1 User2-history1 User3-history1 User3-history1 User4-history1 User5-history1 NA NA NA Table 8: Negative Identification result of 15-gram As show on the table above the result of normalization a percentage of n-gram count, Maxmin and Z score shows that User2-history1 vs User5-history1 are the same person whereas the others are different. 40 4.2.5 User1 and others comparison for 3-gram Sample 1 Sample 2 User1-history3 A percentage normalization 1 Max-min normalization 2.1E-168 Z score normalization 1 User1-history1 User1-history1 User1-history4 1 4.1E-118 1 User1-history1 User1-history2 1 2.8E-151 1 User1-history1 User2-history1 1 1 3.7E-192 User1-history1 User3-history1 NA 3.6E-211 NA User1-history1 User4-history1 NA NA NA User1-history1 User5-history1 NA NA NA User1-history3 User1-history4 1 0.304123 1 User1-history3 User1-history2 1 0.000854 1 User1-history3 User2-history1 1 0.31365 1 User1-history3 User3-history1 NA NA NA User1-history3 User4-history1 NA NA NA User1-history3 User5-history1 NA NA NA User1-history4 User1-history2 1 0.041853 1 User1-history4 User2-history1 1 0.085828 1 User1-history4 User3-history1 NA NA NA User1-history4 User4-history1 NA NA NA User1-history4 User5-history1 NA NA NA User1-history2 User2-history1 1 1.46E-05 1 User1-history2 User3-history1 NA NA NA User1-history2 User4-history1 NA NA NA User1-history2 User5-history1 NA NA NA User2-history1 User3-history1 NA NA NA User2-history1 User4-history1 NA NA NA User2-history1 User5-history1 NA NA NA User3-history1 User4-history1 NA NA NA User3-history1 User5-history1 NA NA NA User4-history1 User5-history1 NA NA NA Table 9: User1 and others comparison for 3-gram 41 A percentage normalisation As show on the table the result of a percentage normalization shows there are some similarity or positive identification for sample1 and sample2. For example, User1-history1 vs User1-history3 , User1-history3 vs User1-history4, User1-history1 vs User1-history2, User1history3 vs User1-history4, User1-history3 vs User1-history2 and User1-history4 vs User1history2 those comparisons are belong to the same person and the analysis success identified them as positive identification. However, User1-history1 vs User2-history1, User1-history3 vs User2-history1, User1-history4 vs User2-history1 and User1-history2 vs User2-history1 shows they are the same person but they are different authors. Furthermore, for the others machine comparison they success to give negative Identification Max – min Normalization The Max – min normalization shows there are some similarity or positive identification such as User1-history1 vs User2-history1, User1-history3 vs User2-history1 and User1-history4 vs User2-history1 nonetheless they are different person only User1-history3 vs User1-history4 are the same person. However, for negative identification also gives the unsatisfied result for instance the comparisons between the same person’ histories (User1) this normalization say they are different people. Furthermore, the others machine comparison (different person) shows they successfully perform negative Identification. 42 Z score Normalization Z score normalization show significant result, for instance User1-history1 vs User1-history3, User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1-history3 vs User1history4, User1-history3 vs User1-history2 and User1-history4 vs User1-history2 are the same person or this normalisation success to give positive identification. However, this method also gives false identification such as User1-history3 vs User2-history1, User1history4 vs User2-history1 and User1-history2 vs User2-history1 are the same person nonetheless they are different person. In conclusion, this normalization is success to identify negative identification of different person. 43 4.2.6 User1 and others comparison for 5-gram a. Success and fail positive and negative identification of T-Test with different participant Sample 1 Sample 2 User1-history1 User1-history3 A percentage normalization Max-min normalization Z score normalization 0.00915 1 1.49E-16 1 0.287977 1 5.34E-13 8.67E-09 1 0.078527 NA NA 1 0.002453 1.17E-05 1 0.004408 1 0.002023 3.16E-05 0.000436 1 NA NA 2.5E-05 1 2.34E-10 1 1.24E-12 3.68E-05 0.362585 1 NA NA 0.832204 1 9.47E-10 1.28E-05 0.185036 1 NA NA 0.010267 1 0.003355 2.71E-10 0.054025 NA 1 9.35E-13 NA NA 1 0.337993 NA NA 1 User1-history1 User1-history4 1 User1-history1 User1-history2 1 User1-history1 User2-history1 5.34E-13 User1-history1 User3-history1 1 User1-history1 User4-history1 NA User1-history1 User5-history1 1 User1-history3 User1-history4 1 User1-history3 User1-history2 1 User1-history3 User2-history1 6.95E-09 User1-history3 User3-history1 1 User1-history3 User4-history1 NA User1-history3 User5-history1 1 User1-history4 User1-history2 1 User1-history4 User2-history1 0.00318 User1-history4 User3-history1 1 User1-history4 User4-history1 NA User1-history4 User5-history1 1 User1-history2 User2-history1 1.27E-09 User1-history2 User3-history1 1 User1-history2 User4-history1 NA User1-history2 User5-history1 1 User2-history1 User3-history1 NA User2-history1 User4-history1 NA User2-history1 User5-history1 1 User3-history1 User4-history1 NA User3-history1 User5-history1 NA User4-history1 User5-history1 NA Table 10: User1 and others comparison for 5-gram 44 A percentage normalisation As show on the table the result of a percentage normalization shows there are some similarity or positive identification for sample1 and sample2. For example, User1-history1 vs User1-history3 , User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1history3 vs User1-history4, User1-history3 vs User1-history2 and User1-history4 vs User1history2 those comparisons are belong to the same person and the analysis success identified them as positive identification. However, User1-history1 vs User3-history1, User1-history1 vs User5-history1, User1-history3 vs User3-history1, User1-history3 Vs User5-history1, User1-history4 vs User4-history1, User1-history4 vs User5-history1, User1history2 vs User3-history1 and User1-history2 vs User5-history1 shows they are the same person but they are different authors. Furthermore, for the others machine comparison they success to give negative identification. Max – min Normalization The Max – min normalization is not given sufficient results for positive and negative identification even though in same case they success to identify nevertheless we cannot trust the result since this normalization show inconsistently result. 45 Z score Normalization Z score normalization show significant result, for instance User1-history1 vs User1-history3, User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1-history3 vs User1history4, User1-history3 vs User1-history2 and User1-history4 vs User1-history2 are the same person or this normalisation success to give positive identification. However, this method also gives false identification such as User1-history1 vs User3-history1, User1history3 vs User3-history1, User1-history3 vs User5-history1, User1-history4 vs User3- history1, User1-history4 vs User5-history1, User1-history2 vs User3-history1 and User3history1 vs User5-history1 t-test identify they are the same person nonetheless they are different person. In addition, this normalization is success to identify negative identification of different person. 46 4.2.7 User1 and others comparison for 11-gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 A percentage normalization Max-min normalization Z score normalization User1-history1 User1-history3 1 3.31E-38 1 User1-history1 User1-history4 1 3.32E-14 1 User1-history1 User1-history2 1 4.74E-05 1 User1-history1 User2-history1 1 9.99E-41 1 User1-history1 User3-history1 NA NA NA User1-history1 User4-history1 NA NA NA User1-history1 User5-history1 1 4.25E-57 1 User1-history3 User1-history4 1 0.227777 1 User1-history3 User1-history2 1 2.68E-06 1 User1-history3 User2-history1 1 0.175096 1 User1-history3 User3-history1 NA NA NA User1-history3 User4-history1 NA NA NA User1-history3 User5-history1 1 0.001546 1 User1-history4 User1-history2 1 0.001243 1 User1-history4 User2-history1 1 0.042027 1 User1-history4 User3-history1 NA NA NA User1-history4 User4-history1 NA NA NA User1-history4 User5-history1 1 0.000526 1 User1-history2 User2-history1 1 7.15E-10 1 User1-history2 User3-history1 NA NA NA User1-history2 User4-history1 NA NA NA User1-history2 User5-history1 1 2.01E-15 1 User2-history1 User3-history1 NA NA NA User2-history1 User4-history1 NA NA NA User2-history1 User5-history1 1 0.002685 1 User3-history1 User4-history1 NA NA NA User3-history1 User5-history1 NA NA NA User4-history1 User5-history1 NA NA NA Table 11: User1 and others comparison for 11-gram 47 Table 11 for user1-history and other comparison also give similar result as 3-gram and 5 gram event though in some machine comparison they still give false identification. 48 4.2.8 User1 and others comparison for 15-gram a. Success and fail for negative identification of T-Test with different participant Sample 1 Sample 2 User1-history1 User1-history3 User1-history1 User1-history4 User1-history1 User1-history2 User1-history1 User2-history1 User1-history1 User3-history1 User1-history1 User4-history1 User1-history1 User5-history1 User1-history3 User1-history4 User1-history3 User1-history2 User1-history3 User2-history1 User1-history3 User3-history1 User1-history3 User4-history1 User1-history3 User5-history1 User1-history4 User1-history2 User1-history4 User2-history1 User1-history4 User3-history1 User1-history4 User4-history1 User1-history4 User5-history1 User1-history2 User2-history1 User1-history2 User3-history1 User1-history2 User4-history1 User1-history2 User5-history1 User2-history1 User3-history1 User2-history1 User4-history1 User2-history1 User5-history1 User3-history1 User4-history1 User3-history1 User5-history1 User4-history1 User5-history1 A percentage normalization Max-min normalization Z score normalization 1 2.1E-168 1 1 4.1E-118 1 1 2.8E-151 1 1 1 3.7E-192 NA 3.6E-211 NA NA NA NA NA NA NA 1 0.304123 1 1 0.000854 1 1 0.31365 1 NA NA NA NA NA NA NA NA NA 1 0.041853 1 1 0.085828 1 NA NA NA NA NA NA NA NA NA 1 1.46E-05 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Table 12: User1 and others comparison for 15-gram Table 12, for user1-history and other comparison also give similar result as 3-gram and 5 gram event though in some machine comparison they still give false identification. Max-min normalization still gives insufficient results for positive and negative identification 49 4.3 Summary of Formal Language 4.3.1 Positive Identification Positive Identification (User1 Command Line history for different samples) n-gram 3 Gram 5 Gram 11 Gram 15 Gram Normalization Type Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Correct Identification 6/6 1/6 6/6 6/6 1/6 6/6 6/6 1/6 6/6 6/6 1/6 6/6 Incorrect Identification 0/6 5/6 0/6 0/6 5/6 0/6 0/6 5/6 0/6 0/6 5/6 0/6 Rate Percentage 100% 16.6% 100% 100% 16.6% 100% 100% 16.6% 100% 100% 16.6% 100% Table 13: Positive Identification summary The table above is positive identification summary for formal language, for all six possible pairings of samples, and for the four different n-gram lengths, calculated for each of the three normalisation methods. For all four n-gram lengths, we find that the Percentage and Z Score normalisation methods correctly identify that the user is the same in each case. However the Max-Min normalisation method fails to identify that the user is the same in all but one case for each n-gram length. These results suggest that positive identification can be reliably achieved using the n-gram analysis method for formal language, using either the Percentage or Z Score normalisation methods. However, it indicates also that the Max-Min normalisation method is not useful for positive identification in formal language samples. 50 4.3.2 Negative Identification Negative Identification User1 vs (user2 vs (user3, user4, user5)) n-gram Normalization Type 3 Gram Percentage 23/28 5/28 82.14 Max-Min 20/28 8/28 71.43 Z Score 19/28 9/28 67.86 Percentage 13/28 15/28 46.43 Max-Min 17/28 11/28 60.71 Z Score 14/28 14/28 50.00 Percentage 16/28 12/28 57.14 Max-Min 26/28 2/28 92.86 Z Score 16/28 14/28 57.14 Percentage 23/28 5/28 82.14 Max-Min 24/28 4/28 85.71 Z Score 24/28 4/28 85.71 5 Gram 11 Gram 15 Gram Correct Identification Incorrect Identification Rate Percentage Table 14: Negative Identification summary The table above is a negative identification summary for formal language, for all possible pairings of samples, and for the four different n-gram lengths, calculated for each of the three normalisation methods. The results are less clear than were observed in the positive identification table. The MaxMin normalisation method is correct between 60.71% and 92.86% of the time, showing an improvement over its use in positive identification. The other two normalisation methods were not as reliable as in the positive identification tests 51 Chapter 5 Results 5.1 Positive identification using natural language 5.1.1 Positive Identification Result of 3 Gram a. Success and fail for positive identification of T-Test with the same participant for Jane Austin (1 person) Normalization a percentage of n-gram count Novel 1 Novel2 Z score Max-min Z score normalization normalization normalization Pride and Prejudice Sense and Sensibility 1 2.52E-22 1 Pride and Prejudice Mansfield Park 1 0.170681 1 Pride and Prejudice Emma 1 1E-18 1 Mansfield Park Sense and Sensibility 1 0.533889 1 Mansfield Park Mansfield Park 1 1.49E-24 1 Sense and Sensibility Mansfield Park 1 1.04E-39 1 Table 15: Positive Identification result of 3-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that Pride and Prejudice vs Mansfield Park, Pride and Prejudice vs Emma and Emma vs Mansfield Park are the same but the others are different. 52 5.1.2 Positive Identification Result of 5 Gram a. Success and fail for positive identification of T-Test with the same participant for Jane Austin (1 person) Novel 1 Novel2 Pride and Prejudice Sense and Sensibility Pride and Prejudice Mansfield Park Pride and Prejudice Emma Mansfield Park Sense and Sensibility Mansfield Park Mansfield Park Sense and Sensibility Mansfield Park Z score Max-min Z score normalization normalization normalization 1 0 1 1 4.4E-146 1 1 2.8E-108 1 1 1E-175 1 1 8.24E-05 1 1.2E-164 1 1 Table 16: Positive Identification result of 5-gram As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that only Pride Prejudice vs Sense and Sensibility are the same but the others are different. 53 5.1.3 Positive Identification Result of 11 Gram a. Success and fail for positive identification of T-Test with the same participant for Jane Austen (1 person) Novel 1 Novel2 Pride and Prejudice Sense and Sensibility Pride and Prejudice Mansfield Park Pride and Prejudice Emma Mansfield Park Sense and Sensibility Mansfield Park Mansfield Park Sense and Sensibility Mansfield Park Z score Max-min Z score normalization normalization normalization 1 0 1 1 0 1 1 0 1 1 9.55E-71 1 1 2E-99 1 1 0.035547 Table 17: Positive Identification result of 11-gram 1 As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that Pride and Prejudice vs sense and Sensibility, Pride and Prejudice vs Mansfield Park and Pride and Prejudice vs Emma are the same person but the other are different person. 54 5.1.4 Positive Identification Result of 15 Gram a. Success and fail for positive identification of T-Test with the same participant for Jane Austin (1 person) Novel 1 Novel2 Pride and Prejudice Sense and Sensibility Pride and Prejudice Mansfield Park Pride and Prejudice Emma Mansfield Park Sense and Sensibility Mansfield Park Mansfield Park Sense and Sensibility Mansfield Park Z score Max-min Z score normalization normalization normalization 1 0 1 1 0 1 1 0 1 1 5.16E-29 1 1 1.75E-20 1 1 1.58E-119 Table 18: Positive Identification result of 15-gram 1 As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that Pride and Prejudice vs Sense and Sensibility, Pride and Prejudice vs Mansfield Park and Pride and Prejudice vs Emma are the same person but the other are different person. 55 5.2 Negative Identification Using Natural Language 5.2.1 Negative Identification Result of 3 Gram a. Success and fail for negative identification of T-Test with different participant for William Shakespeare and Jane Austen (2 person) Normalization a percentage of n-gram count Novel 2 Novel 2 Z score normalization Max-min normalization Z score normalization Julius Caesar Romeo and Juliet 1 4.21E-08 1 Julius Caesar Hamlet 1 1.5E-101 1 Julius Caesar Sonnet 1 7.5E-108 1 Julius Caesar Sense and Sensibility 1 2.2E-142 1 Julius Caesar Pride and Prejudice 1 2.1E-108 1 Julius Caesar Mansfield Park 1 9.12E-62 1 Julius Caesar Emma 1 3E-104 1 Romeo and Juliet Hamlet 1 1.4E-120 1 Romeo and Juliet Sonnet 1 8.5E-147 1 Romeo and Juliet Sense and Sensibility 1 3.2E-112 1 Romeo and Juliet Pride and Prejudice 1 1.06E-63 1 Romeo and Juliet Mansfield Park 1 8.4E-126 1 Romeo and Juliet Emma 1 1.95E-71 1 Hamlet Sonnet 1 4.1E-17 1 Hamlet Sense and Sensibility 1 0.000987 1 Hamlet Pride and Prejudice 1 2.05E-26 1 Hamlet Mansfield Park 1 3.19E-09 1 Hamlet Emma 1 3.8E-116 1 Sonnet Sense and Sensibility 1 1.6E-71 1 Sonnet Pride and Prejudice 1 5.77E-31 1 Sonnet Mansfield Park 1 2.98E-94 1 Sonnet Emma 1 8.37E-48 1 Sense and Sensibility Pride and Prejudice 1 6.07E-87 1 Sense and Sensibility Mansfield Park 1 9.96E-05 1 Sense and Sensibility Emma 1 1.85E-22 1 Pride and Prejudice Mansfield Park 1 4.71E-25 1 Pride and Prejudice Emma 1 1.17E-64 1 Mansfield Park Emma 1 1.17E-64 1 Table 19: Negative Identification result of 3-gram 56 As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that both participants are different. 57 5.2.2 Negative Identification Result of 5 Gram a. Success and fail for negative identification n of T-Test with different participant for William Shakespeare and Jane Austin (2 person) Novel 2 Novel 2 Z score Max-min normalization normalization 1 2.37E-87 Julius Caesar Romeo and Juliet 1 3.2E-162 Julius Caesar Hamlet 1 1.5E-226 Julius Caesar Sonnet 1 0.00591 Julius Caesar Sense and Sensibility 1 2.34E-22 Julius Caesar Pride and Prejudice 1 0.000294 Julius Caesar Mansfield Park 1 2.07E-22 Julius Caesar Emma 1 0.000195 Romeo and Juliet Hamlet 1 8.91E-47 Romeo and Juliet Sonnet 1 2.82E-49 Romeo and Juliet Sense and Sensibility 1 1.15E-18 Romeo and Juliet Pride and Prejudice 1 2.16E-54 Romeo and Juliet Mansfield Park 1 2.52E-24 Romeo and Juliet Emma 1 1.87E-31 Hamlet Sonnet 1 8.93E-88 Hamlet Sense and Sensibility 1 3.05E-36 Hamlet Pride and Prejudice 1 2.42E-88 Hamlet Mansfield Park 1 3.26E-47 Hamlet Emma 1 1.2E-161 Sonnet Sense and Sensibility 1 9.8E-108 Sonnet Pride and Prejudice 1 2.7E-164 Sonnet Mansfield Park 1 1.7E-122 Sonnet Emma 1 1.49E-09 Sense and Sensibility Pride and Prejudice 1 0.614744 Sense and Sensibility Mansfield Park 1 3.64E-08 Sense and Sensibility Emma 1 3.37E-30 Pride and Prejudice Mansfield Park 1 0.045969 Pride and Prejudice Emma 1 8.66E-29 Mansfield Park Emma Table 20: Negative Identification result of 5-gram Z score normalization 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However Max-min normalisation shows that only Sense and Sensibility vs Mansfield Park are the same and the others are different. 58 5.2.3 Negative Identification Result of 11 Gram a. Success and fail for negative identification of T-Test with different participant for William Shakespeare and Jane Austin (2 person). Novel 2 Novel 2 Z score normalization Max-min normalization Z score normalization Julius Caesar Romeo and Juliet 1 na 1 Julius Caesar Hamlet 1 0 1 Julius Caesar Sonnet 1 3.30392E-39 1 Julius Caesar Sense and Sensibility 1 1.18605E-32 1 Julius Caesar Pride and Prejudice 1 7.98933E-07 1 Julius Caesar Mansfield Park 1 9.62269E-21 1 Julius Caesar Emma 1 3.26764E-06 1 Romeo and Juliet Hamlet 1 0 1 Romeo and Juliet Sonnet 1 3.30392E-39 1 Romeo and Juliet Sense and Sensibility 1 1.18605E-32 1 Romeo and Juliet Pride and Prejudice 1 7.98933E-07 1 Romeo and Juliet Mansfield Park 1 9.62269E-21 1 Romeo and Juliet Emma 1 3.26764E-06 1 Hamlet Sonnet 1 1.0359E-21 1 Hamlet Sense and Sensibility 1 9.87798E-17 1 Hamlet Pride and Prejudice 1 0.236207615 1 Hamlet Mansfield Park 1 4.82439E-05 1 Hamlet Emma 1 0.000530174 1 Sonnet Sense and Sensibility 1 0.289320395 1 Sonnet Pride and Prejudice 1 6.93376E-22 1 Sonnet Mansfield Park 1 3.68687E-10 1 Sonnet Emma 1 2.30855E-28 1 Sense and Sensibility Pride and Prejudice 1 3.93646E-16 1 Sense and Sensibility Mansfield Park 1 9.64117E-07 1 Sense and Sensibility Emma 1 1.72025E-21 1 Pride and Prejudice Mansfield Park 1 2.04396E-08 1 Pride and Prejudice Emma 1 0.088959355 1 Mansfield Park Emma 1 5.5606E-14 1 Table 21: Negative Identification result of 11-gram 59 As show on the table above the result of normalization a percentage of n-gram count and Z score show 100% they are the same person. However, Max-min normalisation shows that, there are five comparison show the same result such as Pride and Prejudice vs Emma (same author), Sonnet vs Sense and Sensibility (different author), Hamlet and Pride and Prejudice (different author), Romeo and Juliet vs Emma (different author) and Julius Caesar and Hamlet (same author) and the others are different. 60 5.2.4 Negative Identification Result of 15 Gram a. Success and fail for negative identification of T-Test with different participant for William Shakespeare and Jane Austin (2 person) Novel 2 Novel 2 Z score normalization Max-min normalization Julius Caesar Romeo and Juliet 1 3.7E-246 1 Julius Caesar Hamlet 1 4.02E-09 1 Julius Caesar Sonnet 0 6.35E-45 1 Julius Caesar Sense and Sensibility 1 1.84E-37 1 Julius Caesar Pride and Prejudice 1 1.36E-37 1 Julius Caesar Mansfield Park 1 1.88E-19 1 Julius Caesar Emma 1 8.79E-31 1 Romeo and Juliet Hamlet 1 4.7E-117 1 Romeo and Juliet Sonnet 0 2.4E-102 1 Romeo and Juliet Sense and Sensibility 1 2.1E-109 1 Romeo and Juliet Pride and Prejudice 1 2.4E-102 1 Romeo and Juliet Mansfield Park 1 2.1E-109 1 Romeo and Juliet Emma 1 9.85E-55 1 Hamlet Sonnet 1 NA 1 Hamlet Sense and Sensibility 1 0.040804 1 Hamlet Pride and Prejudice 1 0.084871 1 Hamlet Mansfield Park 1 0.263444 1 Hamlet Emma 1 0.145122 NA 1 Sonnet Sense and Sensibility 1.87E-11 Z score normalization 1 NA Sonnet Pride and Prejudice 1.46E-12 1 NA Sonnet Mansfield Park 5.69E-20 1 NA Sonnet Emma Sense and Sensibility Pride and Prejudice 2.03E-07 1 0.099914 1 1 Sense and Sensibility Mansfield Park 1 0.566338 1 Sense and Sensibility Emma 1 0.330546 1 Pride and Prejudice Mansfield Park 1 0.860759 1 Pride and Prejudice Emma 1 0.860759 1 Mansfield Park Emma 1 0.523756 1 Table 22: Negative Identification result of 15-gram 61 A Percentage normalization For 15-gram of Normalization a percentage show that the Sonnet vs Sense and Sensibility, Pride and Prejudice, Mansfield Park, Emma are different and the other are the same. Max-min For 15-gram of max-min normalization show there are some similarity of novel, such as ; Hamlet vs Pride and Prejudice (different author), Hamlet vs Mansfield Park (different author), Hamlet vs Emma (different author), Sense and Sensibility vs Pride and Prejudice (same author) Sense and Sensibility vs Mansfield Park (same author) Sense and Sensibility vs Emma (same author) Pride and Prejudice vs Mansfield Park (same author) Pride and Prejudice vs Emma (same author) Mansfield Park vs Emma (same author) and the other novels are different. Z score 15-gram of Z score normalisation shows they are similarity for some comparison and there are five comparison shows they are different. 62 5.3 Summary of Natural Language 5.3.1 Positive Identification n-gram Normalization Type 3 Gram Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score 5 Gram 11 Gram 15 Gram Positive Result (Correct Identification) 18/18 2/18 18/18 18/18 2/18 18/6 18/18 6/18 18/18 18/18 9/18 18/18 Negative Result (False Identification ) 0/18 16/18 0/18 0/18 16/18 0/18 0/18 12/18 0/18 0/18 9/18 0/18 Rate Percentage 100% 11.11% 100% 100% 11.11% 100% 100% 33.33% 100% 100% 50% 100% Table 23: Positive Identification summary The table above is a positive identification summary for natural language, for all possible pairings of same-author samples, and for the four different n-gram lengths, calculated for each of the three normalisation methods. Firstly, for the 3-gram percentage normalization and Z score the percentage rate are 100% success. However, Max-min’s percentage rate only 11, 11% similarity for the pair comparison in User1’s command line history for different machine. It means that Max-Min normalization fail to identify positive Identification. Secondly, for 5-gram, 11-gram and 15gram the percentage normalization and Z score are 100% success for positive identification. On the other hand, the Max-min gives different result for each gram. For instance 5-gram show 11,11% same as 3-gram, 11-gram is 33,33% and 15-gram is 50%. 63 5.3.2 Negative Identification n-gram Normalization Type 3 Gram Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score Percentage Max-Min Z Score 5 Gram 11 Gram 15 Gram Positive Result (Correct Identification) 0/16 16/16 0/16 0/16 16/16 0/16 0/16 16/16 0/16 0/16 2/16 0/16 Negative Result (False Identification ) 16/16 0/16 16/16 16/16 0/16 16/16 16/16 0/16 16/16 16/16 14/16 16/16 Rate Percentage 0% 100% 0% 0% 100% 0% 0% 100% 0% 0% 12.5% 0% Table 24: Negative Identification summary The table above is a negative identification summary for natural language, for all possible pairings of different-author samples, and for the four different n-gram lengths, calculated for each of the three normalisation methods. The table above is the summary of negative identification shows the unsatisfactory result for each gram. The table 24 negative identification summary show how negative identification for natural language fails for user identification. However, for Max-min normalization especially for 3-gram, 5-gram, and 11-gram show that we success 100% to identify the negative identification. However we cannot trust max-min normalization since in both formal language and natural language max-min normalization show the result always different. 64 Chapter 6 Conclusion and Future Work 6.1 Conclusion In this thesis, we investigate user writing styles which aims to be able to identify users positively and negatively. We investigate formal language and natural language by use ngram methodology. There are five participants in formal language and two famous writers for natural language. We compare the result of n-gram analyses from each participant and assess how successful this comparison by use t-test for paired two samples for means. The result shows that formal language can identify users in term of positive and negative identification. However, for natural language, the n-gram analysis is successful for positive identification but not for negative identification. Thus, formal language is shown to be more generally accurate. Research question 1: does the user of n-gram analysis to profile users’ command usage in their command line histories allow accurate user identification? a. Positive Identification of Formal Language Normalization Type Percentage Max-min z Score Success Total 100% 16.66% 100% Table 25: Success total The investigation of positive investigations is show successfully for identify the same user especially for a percentage and z score normalization both normalization show 100% success but not for max-min normalization only 16.66%. 65 b. Negative Identification of Formal Language Normalization Type Percentage Success Total 66.96 Max-min 77.67 z Score 65.17 Table 26: Success total The investigation of negative investigations is show successfully for identify the same user for every type of normalization a percentage provide 66.96 %, Max-min 77.67% and z score 65.17%. In addition, Negative investigation also show satisfied result, this investigation show each type of normalization success to identify different user event though they have different result in term of percentage. Research question 2: Does the use of n-gram analysis to profile users’ writing styles in natural language situation allow accurate user identification? a. Positive Identification of Natural Language Normalization Success Type Total Percentage 100% Max-min 26.38% z Score 100% Table 27: Success total The investigation of positive investigations is show successfully for identify the same user especially for a percentage and z score normalization, both normalization show 100% success but not for max-min normalization only show 26.38%. 66 b. Negative Identification of Natural Language Normalization Type Success Total Percentage 0% Max-min 100.00% z Score 0% Table 28: Success total The investigation of Negative investigations is show unsuccessful for identify the same user for percentage and z score normalization both type of normalization show 0% to identify a user. However, Max-min normalization is success to identify positive identification but we cannot trust Max-min normalization due to this type of normalization always show unreliable answer. Research question 3: if the profiling of both writing styles and command usage allow accurate user profiling, which is the most accurate? According to investigation result, formal language show accurate information in term of investigation user profiling. It is because formal language success to identify positive and negative user identification. 6.2 Future Work This main objective of this research is to identify a user’s writing style in term of formal language and natural language especially for user reauthentication. This research used ngram methods to assess the user in two writing styles. The investigation used software that created from previous researcher for counting n-gram. The software will store the counting gram from the history file of the participants into excel file. After that we use excel to create n-gram spectrum and from n-gram spectrum we compare each participants to see the similarity and different their writing style. 67 Further experiment have to be made for continue the investigation, as follows. Firstly, for formal language we can investigate by divided to period of time for instance per month or week, rather than compare different machine in different work place. It is because working in office or home could be different in term of psychologies. Secondly, we should try another gram, such as 1,2,4,6,7,8,9,10,12,13, since every gram length appears to show a different result, and another gram length could give a more accurate result for formal and natural language. Thirdly, we can assume one person could have more than one writing style. Try to collect their writing style and match it to each other user’s writing style. Finally, another analysis should be to compare each of the participants in both directions. For example, after compare ‘A’ person writing style to ‘B’ person writing style, we should swap the order, so ‘B’ must compared to ‘A’ as well. Thus, from both result we can compare it and see how the result. 6.3 Summary In summary this thesis has shown that the use of n-gram analysis for identifying users in a reauthentication situation is feasible and that further research on this area will deliver additional results. 68 References: ABOU-ASSALEH, T., CERCONE, N., KESELJ, V. & SWEIDAN, R. Year. n-gram-based detection of new malicious code. In: Computer Software and Applications Conference, 2004. COMPSAC 2004. Proceedings of the 28th Annual International, 28-30 Sept. 2004 2004. 41-42 vol.2. Ahmed, A.A.E, 2005, 'Detecting computer intrusions using behavioral biometrics'. Alexandre, TJ 1997, 'Biometrics on smart cards: An approach to keyboard behavioral signature', Future Generation Computer Systems, vol. 13, no. 1, pp. 19-26. Almassian, N, Azmi, R & Berenji, S 2009, 'AIDSLK: An Anomaly Based Intrusion Detection System in Linux Kernel', Information Systems, Technology and Management, pp. 232-243. Ashman, H & Holland, S 'Profiling and identifying users with n-gram analysis on their command line histories' (in draft). Balduzzi, M, Platzer, C, Holz, T, Kirda, E, Balzarotti, D & Kruegel, C 2010, 'Abusing Social Networks for Automated User Profiling', in Jha, S, Sommer, R & Kreibich, C (eds), Recent Advances in Intrusion Detection, vol. 6307, Springer Berlin / Heidelberg, pp. 422-441. BHATTACHARYYA, P., GARG, A.&WU, S. F.Social Network Model Based on Keyword Categorization. Social Network Analysis and Mining, 2009. ASONAM '09. International Conference on Advances in, 20-22 July 20092009. 170-175. CANALI, C., CASOLARI, S. & LANCELLOTTI, R. A quantitative methodology to identify relevant users in social networks. 2010. Cha, B 2005, 'Host Anomaly Detection Performance Analysis Based on System Call of Neuro-Fuzzy Using Soundex Algorithm and n-gram Technique', Proceedings of the 2005 Systems Communications. COLES, R.&HODGKINSON, G. P.2008. A Psychometric Study of Information Technology Risks in the Workplace. Risk Analysis,28,81-93. DWYER, C., HILTZ, S. R. & PASSERINI, K. Year. Trust and privacy concern within social networking sites: A comparison of Facebook and MySpace. In, 2007. Citeseer. FELT, A. & EVANS, D. 2008. Privacy protection for social networking APIs. 2008 Web 2.0 Security and Privacy (W2SPí08). GRZECH, A. 2006. Anomaly detection in distributed computer communication systems. Cybernetics and Systems, 37, 635-652. GUO, L., TAN, E., CHEN, S., ZHANG, X. & ZHAO, Y. E. Year. Analyzing patterns of user content generation in online social networks. In, 2009. ACM, 369-378. HUBBALLI, N., BISWAS, S. & NANDI, S. Year. Sequencegram: n-gram modeling of system calls for program based anomaly detection. In: Communication Systems and Networks (COMSNETS), 2011 Third International Conference on, 4-8 Jan. 2011 2011. 1-10. 69 Hung, J,Y, Huang Y,C, Hsu, J, Y & Wu, D, K, C 2008, 'Tag-Based user profiling for social media recommendation'. Keselj, FPDSV & Wang, S 'Language Independent Authorship Attribution using Character Level Language Models'. KEÖELJ, V., PENG, F., CERCONE, N. & THOMAS, C. Year. n-gram-based author profiles for authorship attribution. In, 2003. Citeseer. LUO, F., OU, Q.&WEI, G.Research on n-gram-based malicious code feature extraction algorithm. Computer Application and System Modeling (ICCASM), 2010 International Conference on, 22-24 Oct. 20102010. V6-89-V6-92. Maia, M, Almeida, J, Virg & Almeida, l 2008, 'Identifying user behavior in online social networks', Proceedings of the 1st Workshop on Social Network Systems, Glasgow, Scotland. MATYA, X, X030C, V. & IHA, Z. Year. Security of biometric authentication systems. In: Computer Information Systems and Industrial Management Applications (CISIM), 2010 International Conference on, 8-10 Oct. 2010 2010. 19-28. McKinney, S & Reeves, DS 2009, 'User identification via process profiling: extended abstract', Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research: Cyber Security and Information Intelligence Challenges and Strategies, Oak Ridge, Tennessee. N.P.Dau, V, Rau, V & J.Templeton, S 2000, 'profiling users in the UNIX OS Environment'. Pannell, G & Ashman, H 2010, 'User Modelling for Exclusion and Anomaly Detection: A Behavioural Intrusion Detection System', in De Bra, P, Kobsa, A & Chin, D (eds), User Modeling, Adaptation, and Personalization, vol. 6075, Springer Berlin / Heidelberg, pp. 207-218. OKAZAKI, Y., SATO, I. & GOTO, S. A new intrusion detection method based on process profiling. Applications and the Internet, 2002. (SAINT 2002). Proceedings. 2002 Symposium on, 2002 2002. 8290. Pepyne, D,L , Hu, J & Gong W 2004, 'User profiling for computer security'. Reddy, DKS & Pujari, AK 2006, 'n-gram analysis for computer virus detection', Journal in Computer Virology, vol. 2, no. 3, pp. 231-239. REFORMAT, M.&GOLMOHAMMADI, S. K. Year. Updating user profile using ontology-based semantic similarity. In: Fuzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference on, 20-24 Aug. 20092009. 1062-1067. TAKEDA, M., MATSUMOTO, T., FUKUDA, T. & NANRI, I. Year. Discovering characteristic expressions from literary works: A new text analysis method beyond n-gram statistics and KWIC. In, 2000. Springer, 112-126. THELWALL, M. 2008. Social networks, gender, and friending: An analysis of MySpace member profiles. Journal of the American Society for Information Science and Technology, 59, 1321-1330. 70 WEI, W., XIAOHONG, G. & XIANGLIANG, Z. Profiling program and user behaviors for anomaly intrusion detection based on non-negative matrix factorization. Decision and Control, 2004. CDC. 43rd IEEE Conference on, 14-17 Dec. 2004 2004. 99-104 Vol.1. Zhang, B, Yin, J, Hao, J, Wang, S & Zhang, D 2007, 'New Malicious Code Detection Based on N-Gram Analysis and Rough Set Theory', in Wang, Y, Cheung, Y-m & Liu, H (eds), Computational Intelligence and Security, vol. 4456, Springer Berlin / Heidelberg, pp. 626-633. Appendix A 71