DOC Version

advertisement
School of Computer and Information Science
CIS Minor Thesis
Profiling and Identifying Individual Users by Their Command Line Usage and
Writing Style
Darusalam
Date:
Supervisor: AsPr Helen Ashman
1
Abstract
Profiling and identifying individual users is an approach to help recognize intrusions in
a computer system. User profiles are important in many applications since they record
highly user-specific information - profiles are basically built to record information about
users or for users to share experiences with each other. Thus, user profiles are used to
aggregate relevant information about a user’s activities, and to identify patterns in their
behavior. In a computer security situation, anomaly detection can be performed over
these user profiles to reauthenticate the user as they interact with the computer.
This thesis extends previous research on reauthenticating users with their user profiles.
Reauthentication involves continually checking that a user is who they claim to be, and
not someone masquerading as a legitimate user. This thesis focuses on the potential to
add psychometric user characteristics into the user model so as to be able to detect
unauthorized users who may be masquerading as a genuine user. The specific
characteristics investigated here are a user’s habitual command line usage, a formal
language style, and a user’s prose writing style, a natural language style. This thesis
analyses these two writing styles to determine if they can be used to identify a user as
part of a reauthentication process. It also aims to determine if either is a more accurate
method for identifying users than the other.
There are five participants involved in the investigation for formal language user
identification. We will analyze their formal language usage style in the form of their
command line habits, comparing sequences of commands, and where these are similar
enough we can positively identify the user. However if the result is clearly dissimilar we
can negatively identify the user, that is, show they are different users. Additionally, we
analyze the natural language of two famous writers, Jane Austen & William
2
Shakespeare, in their written works to determine if the same principles can be applied
to natural language use.
This thesis uses the n-gram analysis method for characterising each user’s style, and can
potentially provide accurate user identification. As a result, n-gram analysis of a user's
typed inputs offers another method for intrusion detection as it may be able to both
positively and negatively identify users. The contribution of this research is to assess
the use of a user’s writing styles in both formal language and natural language as a user
profile characteristic that could enable intrusion detection where intruders masquerade
as real users.
3
Contents
Abstract ............................................................................................................................................................ 2
Introduction ...................................................................................................................................................... 6
1.1 Scope and Limits ...................................................................................................................... 7
1.2
Motivation........................................................................................................................... 7
1.3
Field of Thesis...................................................................................................................... 9
1.4
Research Questions............................................................................................................. 9
1.5 Contribution .............................................................................................................................. 9
Chapter 2 .......................................................................................................................................................... 10
Literature Review .......................................................................................................................................... 10
2.1 Computer science profiling in online social networks ................................................................ 10
2.2 n-gram analysis method.............................................................................................................. 16
2.3 Anomaly Detection ..................................................................................................................... 19
Chapter 3 ......................................................................................................................................................... 22
Methodology ................................................................................................................................................ 22
3.1 Experimental Set Up................................................................................................................... 23
3.2 n-gram analysis ........................................................................................................................... 23
3.3 T-Test........................................................................................................................................... 25
3.4 Stage 1 profiling using natural language .................................................................................... 28
3.5 Stage 2 profiling Using formal language ..................................................................................... 29
3.6 Stage the accurate comparison result ........................................................................................ 30
3.7 User Samples............................................................................................................................... 30
Chapter 4 .......................................................................................................................................................... 33
Results ............................................................................................................................................................... 33
4.1 Positive Identification using Formal Language ................................................................... 33
4.1.1 Positive Identification Result of 3 Gram .............................................................................. 33
4.1.2 Positive Identification Result of 5 Gram .............................................................................. 34
4.1.3 Positive Identification Result of 11 Gram ............................................................................ 35
4.1.3 Positive Identification Result of 15 Gram ............................................................................ 36
4.2 Negative Identification using Formal Language .................................................................. 37
4.2.1 Negative Identification Result of 3 Gram ............................................................................. 37
4.2.2 Negative Identification Result of 5 Gram ............................................................................. 38
4.2.3 Negative Identification Result of 11 Gram ........................................................................... 39
4
4.2.4 Negative Identification Result of 15 Gram ........................................................................... 40
4.2.5 User1 and others comparison for 3-gram ........................................................................ 41
4.2.6 User1 and others comparison for 5-gram............................................................................ 44
4.2.7 User1 and others comparison for 11-gram......................................................................... 47
4.2.8 User1 and others comparison for 15-gram.......................................................................... 49
4.3 Summary of Formal Language .......................................................................................... 50
4.3.1 Positive Identification .......................................................................................................... 50
4.3.2 Negative Identification ......................................................................................................... 51
Chapter 5 .......................................................................................................................................... 52
Results ............................................................................................................................................... 52
5.1.1 Positive Identification Result of 3 Gram .............................................................................. 52
5.1.2 Positive Identification Result of 5 Gram .............................................................................. 53
5.1.3 Positive Identification Result of 11 Gram ............................................................................ 54
5.1.4 Positive Identification Result of 15 Gram ............................................................................ 55
5.2 Negative Identification Using Natural Language ................................................................ 56
5.2.1 Negative Identification Result of 3 Gram ............................................................................. 56
5.2.2 Negative Identification Result of 5 Gram ............................................................................. 58
5.2.3 Negative Identification Result of 11 Gram ........................................................................... 59
5.2.4 Negative Identification Result of 15 Gram ........................................................................... 61
5.3 Summary of Natural Language .......................................................................................... 63
5.3.1 Positive Identification .......................................................................................................... 63
5.3.2 Negative Identification ......................................................................................................... 64
Chapter 6 .......................................................................................................................................... 65
Conclusion and Future Work .................................................................................................... 65
6.1 Conclusion ........................................................................................................................ 65
6.2 Future Work...................................................................................................................... 67
References: ...................................................................................................................................................... 69
Appendix A ...................................................................................................................................................... 71
5
Chapter I
Introduction
Profiling is a way of grouping things or individuals into categories or groups based on
characteristics such as situation, appearance, traits (N.P.Dau, Rau & J.Templeton 2000). The
term profiling means to get information about a user’s activities, and it is possible to
perform anomaly detection over a user profile to allow user identification. This thesis plans
to investigate psychometric user characteristics and so as to be able to identify a user.
There are many examples of profiling in the area of information technology (N.P.Dau, Rau &
J.Templeton 2000), such as profiling users to know about their computer usage patterns. As
a result, users are able to use and distribute the system resource more efficiently and to
offer best services. Furthermore, in other areas, profiling is used for user identification in
Internet-based commerce.
Inter-social network functionalities and operations are necessary for several activities such
as profiling and identifying user behavior (Raad, Chbeir & Dipanda 2010). User profiles are
used to collect relevant information about a user’s activities and anomaly detection is
performed over these user profiles. Furthermore, user profiles are important in a range of
applications, due to being able to record and compare against user-specific information
(Ashman & Holland). According to (Ashman & Holland) user profiles are a good method for
reauthentication and intrusion detection.
6
This thesis is organized as follows. A brief introduction is given in chapter I, motivating the
work and stating the Research Questions to be answered. Chapter 2 comprises a literature
review (background and related work). Chapter 3 describes methods that are used in this
thesis. Chapters 4 and 5 discuss the result of investigation, with chapter 4 looking at the
results from formal language analysis while chapter 5 looks at the results from natural
language analysis. Finally Chapter 6 concludes and discusses future work.
1.1 Scope and Limits
This thesis will focus on evaluating the potential of two psychometric user characteristics,
namely writing style in both Natural Language (Jane Austin & William Shakespeare) and
Formal language (command line histories). To evaluate this particular characteristic this
research will use the n-gram analysis method, and will aim to identify users in two ways,
positive user identification and negative user identification. The work will however not
implement the use of these user characteristics in an intrusion detection system, however it
establishes whether they can be used in such a system.
1.2 Motivation
Profiling and identifying can help recognize intrusions. According to Pannell and Ashman
(Pannell & Ashman 2010), user profiling is already a necessary part of the personalization of
information delivery, and they propose it as an approach for identifying attacks to a
computer system is by profiling program and user behaviors (Wei, Xiaohong & Xiangliang
2004). Anomaly detection over a user profile can detect when an intruder is masquerading
as a genuine user.
The thesis extends previous research that implemented an intrusion detection system based
on biometric characteristics such as keystroke analysis and mouse use and psychometrics
characteristic such as user prose style and favorite web pages (Pannell and Ashman 2010).
7
However this thesis will focus on one potential psychometric user characteristic and will
consider whether a user’s writing styles in two different scenarios can be assessed with ngram analysis in order to identify the user. Users’ writings may be in the form of text in a
novel, books, blogs, tweets and emails, and this is a form of natural language. On the other
hand, users’ writings also occur in the way they interact with computers, issuing commands
through a command line interface, and this is a form of formal language. This research will
perform the same analysis on data of both types, using exactly the same analysis, and will
determine firstly whether either form can be used successfully for user identification, and if
so, the research will then will determine which is the most effective of the two.
This research will analyze the two different forms of data in two ways, firstly to check
whether it can detect when the current user does not match the user profile and is hence an
intruder – this is negative identification. The second way is to detect whether the current
user can unquestionably be verified as the true user – this is positive identification. Most
intrusion detection systems assume that the user is genuine until anomalies or broken rules
show otherwise, that is, they only make use of negative identification. However it might be
useful to constrain a user’s activities until they positively identify themselves, perhaps not
allowing the user to make significant changes until their current login session has been
positively identified.
In this research, the default position will be that the system has no evidence about the
user’s identity, other than the fact that the user managed to log in. Analyzing their activity
after logging in should either give positive information that correlates strongly with the
user’s profile and confirms their identity, or it should mismatch the profile and the user
would then be rejected from the system.
8
1.3 Field of Thesis
Computer science; Data Mining; n-gram Analysis;
1.4 Research Questions
The research will answer the following questions:
Q1: Does the use of n-gram analysis to profile users’ command usage in their command line
histories allow accurate user identification?
a) if so, does it allow both positive and negative identification?
Q2: Does the use of n-gram analysis to profile users’ writing styles in their natural language
allow accurate user identification?
a) if so, does it allow both positive and negative identification?
Q3: If the profiling of both natural language writing styles and command usage allow
accurate user profiling, which is the most accurate?
1.5 Contribution
This research will contribute the following knowledge:

Proposing and assessing the usefulness of two psychometric characteristics for a
user profile in an intrusion detection system (IDS) - specifically, comparing formal
and natural language psychometric characteristics using n-gram analysis

Distinguishing between positive and negative identification, and showing whether
this distinction is practical.
9
Chapter 2
Literature Review
This section will discuss some previous research into several core aspects of this minor
thesis. This literature review will investigate some similar research involving n-gram analysis
and additional literature to support features of the project. n-gram analysis is one of the
methods that will be used for the project. Reviewing some papers, which use n-gram
analysis, demonstrates how n-gram analysis has successfully profiled both features and
users in other application areas.
This chapter is organized as follows. In section 2.1 we review the literature on profiling users
such as in in social networks. In section 2.2 we look at n-gram analysis method and how it is
implemented in many applications. In section 2.3 we investigate anomaly detection in some
applications.
2.1 Computer science profiling in online social networks
There are many researches on computer science that use social networks for user profiling.
Social networking is one of the applications that engage the user to be more active and
permit user to create and maintain their own web pages, Maia et al (2008). According to
Vosecky et al. (2009) varieties of social networking have different manners to display and
store information user profile on user’s web profile. Social network has become one of the
applications to identify user profile.
This research continues previous research by Ashman and Holland (in draft). They examined
users to identifying anomaly detection over user model. They classified user model
characteristics into two classes. Firstly there are behavioral biometrics, which represents a
user’s physical characteristics, for example ability to use mouse, habitual miskeying errors
and typing speeds. Secondly there are psychometrics that represents a user’s personal
preferences or decision-based characteristics, for example prose style and favored web
10
pages. In addition, n-gram analysis is used to identify users over their command line
histories. This research will focus on psychometric user characteristics.
Other work involves social networking for identifying user behavior (Maia et al. 2008).
Online social networking lets the users to interact with other users. For example, they can
upload and view content, rank favorite, choose friends, subscribe to users and do other
activity on social networking. They propose a methodology for portraying and identifying
users’ behaviors. YouTube is one of the social networking sites they use and used a
clustering algorithm to identification of relevant user behaviors. The authors mention about
the current web 2.0 services, and the users has different way to interact with other users.
They also outline the future work to investigate the user behavior to show appropriate,
personalized advertisements and define the different classes of user behaviors and to
provide valid performance models for their services. The research provides similar approach
to identify user behaviors on online social network but does not aim to identify users but
rather to classify them.
Pannell and Ashman (2010a) evaluate the intrusion detection system (IDS) to analyse a
user’s activities, and propose to help the administrator to identify and quickly respond to
the intrusion. The authors identify that anomaly detection may be host-based, as a result an
individual user could be profiled by their normal usage patterns. The authors planned to
examine further characteristics such as, writing style analysis, and the research in this thesis
takes up this future work and aims to enable intrusion detection by author identification.
Other work discusses the use of behavioral biometrics for intrusion detection applications
(Ahmed & Traore 2005). In this paper they outline a new method for user profiling based on
biometrics. According to the authors, these new methods give more accurate information
11
than traditional statistical profiling methods, because the new method is based on a user’s
biological characteristics. They also mention about the two types of biometrics such as
behavioral biometrics and physiological biometrics. The aim of the research is to detect the
misuse of user identification.
Another work investigates the use of biometrics on smart cards (Alexandre 1997). The
weakness of security with the password use for access control and authentication motivates
the work, and the researcher aims to solve the problems with biometrics for user
identification likes fingerprints, handwritten signature and voice recognition. In their paper
they propose a new method of biometric identification according to keyboard signature.
The reliability and simplicity of their approach is suitable user for smart card application. A
neural network is applied for supervised and self-organizing method for estimated the
efficiency and performance. However they do not provide the result of their approach due
to lack of investigation. They also not mention the future work of this research.
Another interesting research by Pepyne, Hu & Gong (2004), analyses user profiling for
computer security in particular users such as insurance adjusters and bank tellers. They use
logistic regression modeling methods and queuing theory for profiling user behavior in
computer usage. The aims of their work are to profile a particular user through their
habitual use of computer in the regular and same way. The paper also mentions their
demonstration in the use of logistic regression modeling, queuing theory and profiling for
misuse detection and intrusion for a particular user of the computer. The result is the use of
computer in this particular user could be effective for misuse detecting and intrusion. The
limitation of the system is that the approach is not based on real time detection since the
system can make a decision when the system finish. The future work is not discussed in this
paper to solve the problem.
12
Another work investigates user profiling based on tag based in social media
recommendation (Hung et al. 2008). The authors mention that tagging is regularly using
currently in some social media webpages. The paper explains about the new aims for
profiling user based on the tags that link to personal content. The experiment is involved
42,463 users which were already collected for bookmarks and associated tag that use for
compared in variation view. The paper is not clear enough to discuss the future work for
extend their research.
A further work outlines the purpose of user profile is to gather the related information
based on user interests (Reformat & Golmohammadi 2009). The ontology-based semantic
similarity method is used to extend and sustain a user profile based on web access
behaviour of user in music domain. However the lack of data support to justify whether the
method is effective and no future work is discussed to develop the research.
Other researchers evaluate users in social network to identify related information by use a
quantitative method base on principal component analysis. The interested of social
networking lets millions of users are register to social networking such as YouTube. As a
result social network is becoming important places for business and marketing
advertisement. To utilize the social networking as a advertisement, it is important to classify
the user and to identify related users.
School of Computing and Information Technology, University of Wolverhampton, UK outline
about the demographic of MySpace member profile, the authors use two samples 15,043
and 7,627. The analysis focus on finding and conjectures in social networking, for instance
women members more interested in friendship, however men member interested in dating.
The authors conclude that, users in MySpace are typically 21, single and use a public profile,
13
tent to be more interested on friendship and logging one in a week to keep in touch with
friends member.
One of the researches is to identify individual user by process profiling (McKinney & Reeves
2009). They investigated Intrusion detection systems by collecting data from employee in
small organization during 3 weeks. They examine computer usage from each employee to
find a user profile by use Naïve Bayer classifier. The authors outline that with this method
they successfully identify 98% of users and the error rate is 4%. The research shows
different method to find individual user profile, however our research will use n-gram
analysis for author identification.
Another research from (N.P.Dau, Rau & J.Templeton 2000) examined the UNIX operating
system to identify the user based on the login host, the login time, the command set and
command set execution time of the profiled user. They outline two essential points, firstly is
the user host-dependent which means one user could use different profiles on different
host. Secondly, profile drift occurred over time that is divided into two ways one is force
profile drift and natural profile drift. This research is different with our research which
focuses on Psychometric user characteristics. However the issues raised in that work
regarding profile drift will be relevant to future work after this project.
Another piece of related research in regarding identifying user in social network that
concern in trust and privacy (Dwyer, Hiltz & Passerini 2007). The author research about
Privacy and trust on social networking sites especially in MySpace and Facebook. The
authors report that both MySpace and Facebook have the same issues in privacy. For
example Facebook members are willing to share their information to their members as well
as MySpace. The result show that in online interaction, trust is not important to create new
14
relationship because is not face-to-face interaction. In conclusion, the privacy and trust in
social networking is not concern yet for behaviour and activity. This paper shows another
interesting type of research that use social network for identifying user.
Another interesting research focuses on the connection of network topology and semantic
similarity of user keywords (Bhattacharyya, Garg & Wu 2009). Categories of keyword and
the notion of the distance among multiple categories trees and keyword across were used in
a forest model. After that the authors use the keyword distance to find the similarity
function between a pair of users and how social network topology could be designed based
on similarity. Finally, by used a simulated social graph they validated social network
topology model that contrast with the real social graph dataset. In conclusion, keyword
offers the effective way to analysing and modelling online social networking. In future work
they will explore other methods that are focus on machine learning technique to against the
forest structure. This research use connection semantic similarity of user keywords and our
research will analyse prose text over command line.
Work by Takeda et al (2000) outline characteristic expression in literary work. Their problem
is, take literary work as positive examples (first writer) and negative examples of works by
another writer especially in Japanese Poems (Waka poems and prose texts). The method is
to create a sequence list of substring of writer goodness, and identify the first list by human
expert. They propose to assist human expert in two ways, one is restrict the prime substring
and focus on the string and content for a way of browsing. They have successful identify
Waka poems but not for prose texts. To get better identify especially for prose text it will be
their future work. Ours research shows similar approach to the adaption of prose text for
characteristic expression in literary work.
15
There is also research about the misuse of social networks for Automated User Profiling
(Balduzzi et al. 2010). Their analysis the users’ weakness when registered to the social
networking such as Facebook by used their email address. They start to collect 10.4 million
of email addresses and they successfully automatically identify 1.2-million user profile with
related with these addresses. Very good analysis is present on the paper by using several
popular social networking such as Facebook, Myspace, XING and Twitter.
Raad et al. (2010) investigate the solution of inter-social network especially in functionalities
and operations. The researchers also concentrate on the user profile matching. This paper
they use a framework to match the user profile in social networks. The framework is
capable to match the biggest number of user profile that refer to the same person which is
the current approaches are unable to detect.
Other research investigates how to identify a user based on similar profile (Vosecky, Dan &
Shen 2009). They used social networking such as MSN and Facebook to collect user profiles.
Users profiles are used to create tools especially for a profile comparison, to decide the
similar profile is belonging to the same person or not. The use of vector-based comparison
algorithm is a method to compare each user profile. The researcher also evaluates the result
of profile comparison algorithm in two phases training and testing.
2.2 n-gram analysis method
There are many papers that use n-gram analysis method and it is implemented in many
applications. n-gram analysis that use for many purposes, including computer virus
detection, author profile and language independent authorship. n-gram analysis is one of
the methods that will be used for this project. According to Luo et al (2010), n-gram method
is ‘language model based on collinear relation’. Some of the n-gram analysis is use in many
purposes that we will explore here and evaluate for this project.
16
Luo et al (2010) outline the use of an n-gram-based malicious code feature extraction
algorithm with statistical language model. By using trigram (3-gram) model they can
characterise malicious code features and hence detect the virus. As a result they can reduce
the time and space of computer rather than detect the virus from heuristics scheme that is
costly and ineffective. The use of n-gram analysis offers efficiency and correctness in the
analysis of malicious code.
Another approach that use n-gram analysis for virus detection (Reddy & Pujari 2006). They
merged some classifiers with the use of Dempster Shafer theory such as SVM, IBK and
Decision Tree to get accurate classifiers rather than use one Theory. However using n-gram
analysis for virus detection lacks semantic awareness. As a result they had difficulties to
analysis the appropriate n-gram they find.
There is also an n-gram analysis method used for automatically detect malicious code
(Zhang et al. 2007). Experiments were carried out by collecting 201 different windows
executable files (109 benign codes and 92 malicious code). The result showed that by using
n-gram method they could successfully distinguish between malicious code and benign code.
Other research is in area of anomaly detection that use n-gram modelling to create normal
profile (Hubballi, Biswas & Nandi 2011). They outline an investigation to build a program
based on anomaly detection by use of Occurrence Frequency model. The model is effective
in short system call sequences. In addition an effect of the method is to build normal
program model that can be applied in some level of infection in the training dataset. The
authors also mentioned the advantage of the method is that the detection becomes
immune to accidental ‘infection’ in the training dataset. n-gram Tree Method is an effective
17
way to create a profile of normal behaviour. The benefits of n-gram tree analysis are that it
is easy to use and fast operational.
Abou-Assaleh et al. (2004), also use n-gram analysis to detect malicious code. To produce an
automatic signature from benign software and malicious code they utilize an n-gram
analysis method. It is because n-gram analyses are able to classify hidden malicious code
and benign code. Thus, the performance n-gram analysis method is 90% correctness in
training data for malicious code and benign code and 91% correct for five-fold cross
validation.
n-gram analysis based on author profile also applies in authorship attribution (Keöelj et al.
2003). The researchers are outline-automated authorship which indirect relationship for
create author profile as vectors of feature language model, similar and weight. n-gram
analysis is use as their approach to get author profile and language independent. The
experiment is using some languages for instance English, Greek and Chinese data to
generate the effective of approach in language independence. Thus, for the Greek data sets
they get better uniformly than previously reported.
Another research uses character level language model for authorship attribution (Keselj &
Wang). They examined assisted authorship with character rank n-gram analysis methods.
The authors explained language independence and theoretical principle in simple way. To
show the effective answers from both approaches they used experimental result on English,
Greek and Chinese Data. As a result, their approach showed the performance in every case
from different achieves states. For example, there was improvement of 18% accuracy for
Greek data set during uses simple method than in previous investigations.
18
In summary, n-gram analyses have many kinds of application, for instance cryptanalysis,
malicious executable detection, language classification and randomness. In other literature,
n-gram analysis is used for many purposes, including computer virus detection, author
profiling and language independent authorship. n-gram analysis is one of the methods that
will be used for this project.
2.3 Anomaly Detection
The purpose of the project is to apply anomaly detection over user profiles. According to
Grzech (2006) anomaly detection refer to a fundamental of intrusion detection system. In
addition, we look at other work to compare how profiling and identifying for anomaly
detection use different methods, which we will explore here.
Grzech (2006) examines different architectures of anomaly detection - it could be as
multiagent systems that support classification system to determine the activity as normal or
abnormal detection. The author also provides the simple illustration hierarchical
architecture of a spread anomaly detection system, which it is possible to implement in the
structure of a multiagent decision supporting system. The author explains detail about
Anomaly detection which divided in two categories are normal and abnormal. The example
of hierarchical anomaly detection system provided to give brief example.
There is research that investigates the anomaly based intrusion detection in Linux Kernel
(Almassian, Azmi & Berenji 2009). A sufficient feature list has been arranged to difference
between normal and abnormal behaviour. The model used is to introduce new tools to the
Linux kernel as protection module that function to log initial data to prepare features list.
Recognize and classified input vector was used support vector machine (SVM). The
evaluation was implemented on the research by use three experiments, including one-class
19
SVM, Binary Classification and Sequence of delayed samples, however future work is not
discussed for further study.
Work by Wang et al (2004) outline the used of non-negative matrix factorization (NMF) to
profile user behaviours and normal program for anomaly detection. This new manner audit
data flowed to system call and used UNIX commands as information source. This manner
telling the normal program and user behaviour was build according to deviation and
features, from user behaviour and normal program above a predetermined threshold is call
anomaly detection. The authors also implemented methods to test with the system call data
from AT& T research lab, Unix command data and University of New Mexico. As a result, the
aim of the method is improved computational expense, detection accuracy and carrying out
as real time intrusion detection. The advantages and disadvantages of NMF are also
mentioned for a comparison to get effective result from the analysis.
Okazaki et al. (2002) research two models for intrusion detection. One is Anomaly Intrusion
detection (AID) and Misuse Intrusion detection (MID). Both model analysis the statistic of a
process in normal term and user behaviour, and then verify whether the system whether
operated in a dissimilar manner. Intrusion detection method based on anomaly intrusion
detection are able identify a new intrusion method. On the other hand, it is necessary to
update the statistic in normal use and the data to telling users behaviour. An MID is needs
some system resource to identify intrusion detection.
Another research that outline system call to detect anomaly detection system by focus on
Neuro Fuzzy learning and soundex algorithm (Cha 2005). It is used to design and change the
variable length data and feature selection into a fixed length learning system. The two
methods Neuro-Fuzzy and n-gram are used for anomaly intrusion detection. To detect the
20
intrusion, they classified the session and generated hosts’ behaviour term by changing the
variable length data to a fixed length pattern.
21
Chapter 3
Methodology
This research aims is to identify a user especially in Natural language (Jane Austen and
William Shakespeare writing style) and Formal Language (command line history). This thesis
will use the n-gram analysis method for author identification – this is an established
authorship attribution method (as discussed in 2.2 above). Furthermore the research also
evaluates how quickly the system learns this characteristic of the user model. The structure
of the method in the research is indicated below.
In section 3.1 in the experimental set up, we provide information how the software works
for counting n-gram for natural language and formal language. 3.2 in this section we will
provide short information about n-gram analysis and give the example of using 3-gram in
the binary and sentence. 3.3 We provide information about t test type and what t test type
we use. 3.4 we show how the n-gram analysis method will be used to identify users by their
use of natural language. In section 3.5, we show how the n-gram analysis method will be
used to identify users by their use of formal language, such as in commands issued in a
command line history. In section 3.6 we show how the accuracy of the two methods will be
compared. The section 3.7 we provide information about user study.
22
3.1 Experimental Set Up
The implementation part is explained about how the application produced the n-gram
frequency. This software application was written in java programming language. There are
two classes in this software, “n-gram.java” and “Data.java”. The program running with the
command “java n-gram [n]” n is the value of n-gram. This software will produce the n-gram
frequency that placed in the Comma Separated Value “csv” folder and distributed to
Microsoft Excel. In ‘csv’ folder contain the history of data which txt. Formatted.
We will use this software for counting the n-gram of history of data from the user writing
style. This software was given from previous research (Ashman and Holland). We use the
software for perform four types of n-gram analysis, namely 3-gram, 5-gram, 11-gram and
15-gram.
3.2 n-gram analysis
An n-gram is a contiguous sequence of n letters, words or phonemes. For example size 1 of
n-gram refer to unigram, size 2 of n-gram refer to bigram, size 3 of n-gram refer to trigram,
size 4 refer to four-gram and in the general case is called an n-gram.
An n-gram analysis is able to count the frequency of n-grams in a given file. For example, in
the binary string level 3-gram such as 1110010000101010010000 has the following
character-level trigrams
111, 110, 100, 001, 010, 100, 000, 000, 001, 010, 101, 010,………………000
And in the sentences “in this work we aim to get the certain knowledge”, has the following
word-level 3-grams:
In this work
This work we
Work we aim
We aim to
23
Aim to get
To get the
Get the certain
The certain knowledge,
And for the phrase “in this work”, has the following character-level 3-grams:
In ,n t, th, thi, his, is , s w, wo, wor, ork
This project uses varying sizes of n-gram such as 3-gram, 5-gram, 11-gram and 15-gram.
Firstly, we will evaluate the use of n-gram analysis of user generated formal language such
as their command line histories to profile users’ command usage in their command line
histories. Secondly, we will evaluate the use of n-gram analysis of natural language to profile
users and to ensure the accurate user identification. After that we will compare each writing
style from each user and see how different or significance of their pattern in term of natural
language and formal language. Next we will visualise their n-gram patterns graphically to
view their frequency pattern.
24
3.3 T-Test
The t-test is a method that can be performed to decide whether two data sets (samples) are
similar or dissimilar and to conclude whether they could have come from the same
population. It assesses whether the means of two groups are statistically different from
each other. This analysis is useful when we want to determine whether the means of two
groups are similar or different. We will use t-tests to assess both natural language and
formal language samples, between two samples from the same user (for positive
identification purposes) and between two samples from different users (for negative
identification purposes).
We next consider which form of t-test is appropriate to this research.
Type of t-test

One sample t-test
The one-sample t test is used to decide whether a specific sample comes from a
specific population. For example, when we want to know about a specific sample of
university student is similar to or different from university student in general. In the
current research we are comparing series of words or commands and while it may
later be feasible to identify a user from a single n-gram value, at this early stage it is
more appropriate to decide whether individuals can be identified from larger
quantities of their writings.

Independent t-test
The independent t-test, or two sample t-test, is used to determine whether two
samples are statistically similar or different to each other between the means in two
unrelated groups. For example, when we want to know between university students
female and male are different or similar on some psychological characteristic. In this
25
research, the samples may not be unrelated, especially when comparing two
samples from the same user.

Dependent t-test
The dependent t-test, also called the paired-group t test, correlated-group t test,
matched- groups t test or dependent-group t test. This t test is used to compare two
related samples (matched or related in same way) that are both measured once or
the same sample measured on two separate occasions. For example, when we want
to know how the effect of using a particular drug for insomnia, for the patient is
similar or different after consume the drug. In this case we will see how the effects
of the drug for the patient before and after consume the drug. This is highly suitable
to this research as we need to positively identify a user by comparing a current
sample of the user’s writing to an older sample of their writing.
From the explanation above we conclude that the dependent t-test or paired group t-test is
the most suitable method for test our investigation.
We use the t test by proposing the following competing hypotheses:

The test hypothesis is the means of population behind the different of two samples.

The null hypothesis is the means of population behind the similarity of two samples.
A probability value p is output by the t-test. The result of probability value is comparison to
the chosen level of significance α to conclude the test result. A common default is α = 0.05:

If the probability value is equal or less then the level of significance, we can reject
the null hypothesis and conclude that the two samples of writing are different

If the probability value is more than the level of significance, we can accept the
hypothesis and conclude that the two sample of writing style are the same.
26

Normalization of samples
Before performing any t-test we will see the distribution of data collection from each gram
whether the distribution of data is normal or nonnormal. If the data nonnormal we will
transform the data to the normal data. This is because the samples we are analyzing are of
radically different size. While it would be possible to choose subsamples from each sample
so that each subsample is an identical size, we elected to normalize each whole sample
instead, as command lines users in particular may have different tasks at different times,
and the subsamples may not accurately reflect the user’s command line habits in a
subsample. By normalizing the samples, we most accurately preserve each user’s writing
styles, but at the same time cast them into the same numerical range so that different
sample sizes do not confound the results.
There are three type of normalization we use to make normal distribution, a percentage
normalization, max-min and Z score. Firstly, percentage normalization is counting the each
value of the n-gram and divided by total value of all n-gram and time to one hundred.
Secondly, max-min normalization is counting the total value all the n-gram and divided by
total number from the reduction of maximum number and minimum number of n-gram.
Lastly, the z score normalization is counting each value of the gram reduction by average
total all the n-gram and divided by standard deviation. We will assess all three normalization
methods in this research to determine which is most appropriate for the task of identifying
users.
27
3.4 Stage 1 profiling using natural language
This natural language stage will evaluate the use of n-gram analysis to profile users
according to their use of natural language.
Research question 1: Does the use of n-gram analysis to profile users’ writing styles in social
network situation allow accurate user identification?
Study the first study will analysis the use of n-gram analysis to profiling users’ lead to
accurate user identification. As a result, it allows identifying positive and negative
users.
We need to create n-gram spectra of known users to populate their profile. So for each
user, we will create many n-gram spectra, each for different values of n. Each n-gram
spectrum will consist of an indexed list of values, where the index runs from 0 to 26 n-1 –
that is, there is an entry in the list for every possible combination of letters of the
alphabet. For example,
A=0
B= 1
C=2
.
.
.
Z = 25
3-grams, 3-letters at a time all possible value between
AAA BAA CAA………… ZAA
AAB
.
.
.
AAZ
Assign each letter a value 0……25, so each 3-gram in a number in base 26 calculate.
28
Eg. ABC = 0 x 26^2 + 1 x 26 ^ 1+ 2 x 26 ^0
0 1 2 = 0 + 26 + 2 = 28
Index of a n-gram in this calculated value
eg. AbC = 28  index in the list for the 3-gram
Note that we will only be doing this task for the letters a..z initially (ignoring case) but if
the method is found to be promising, it will be extended to include all ASCII characters.
For each sample, we calculate an n-gram spectrum of the text . We then compare n-gram
spectra generated by the same author but being different samples (in this case, different
books), as well as comparing n-gram spectra generated by different authors. As the
authorship is ‘known’, the comparison will determine whether the method is accurate
enough to identify when the authors are the same or are different.
That comparison will be done via a t-test. For a given value of n, we calculate the n-gram
spectrum of the current user’s input text, compare it with the spectrum with the same
value of n in the user’s profile, and if the t-test shows they have little difference, then the
user is positively identified (i.e. the same author), but if the differences are significant, the
user will be negatively identified (i.e. different authors).
Part of the research is to decide the most useful values of n in the n-gram analysis, for
example whether shorter or longer n-grams will give the most accurate identification.
3.5 Stage 2 profiling Using formal language
This formal language stage will evaluate the use of n-gram analysis to profile users’
command usage in their command line histories.
Research question 2: does the user of n-gram analysis to profile users’ command usage in
their command line histories allow accurate user identification?
29
Study this study will work in exactly the same way as stage 1. The only difference is that we
use of data from command line histories. For one user, we have four distinct samples
which allow us to test for positive identification, while we have sample from four
other users which allow us to test for negative identification.
3.6 Stage the accurate comparison result
This stage will compare the two methods from the previous sections and identify which one
is the most accurate and under which circumstances.
Research question 3: if the profiling of both writing styles and command usage allow
accurate user profiling, which is the most accurate?
Study this part will assess the method for both formal and natural language, and decide if
either is feasible for user identification. If so, the method will be useful for intrusion
detection as it would be able to both positively and negatively identify users.
3.7 User Samples
In this part we analyze two forms of writing style: natural language and formal language.
a. Natural Language
There were two famous authors used in the natural language identification. Firstly,
William Shakespeare, we take tree famous novel and one poem from his writing.
They are Romeo and Juliet, Julius Caesar, Sonnets and Hamlet. Secondly, we take
Jane Austen’s writings Emma, Pride and Prejudice, Sense and Sensibility and
Mansfield Park. The figure below shows how we compare the samples:
30
Pride and
Prejudice
Sense and
Sensibility
Romeo and Juliet
Julius Caesar
Mansfield Park
Emma
Sonnet
Hamlet
Jane Austen
William Shakespeare
Figure 1: method for comparison of natural language samples
Figure 1 shows how we compare both authors’ writing styles. Firstly, we will see the
result of one author. We compare each of Jane Austen’s writings to each other, using
3-gram, 5-gram, 11-gram and 15-gram analyses. We then use the t-test to measure
their similarity and if the t-test for both pairs in each comparison shows they are
from the same author, we have successfully performed a positive identification.
Secondly, we will do the same procedure for William Shakespeare’s writings.
We will then compare the writing styles of each of Jane Austen’s works with each of
Shakespeare’s and if the t-tests indicate they are different, then we will have
successfully performed a negative identification.
31
b. Formal Language
There were five users involved in this experiment. One example of formal language is
command line history, where the user interacts with the computer through a
command line. By use n-gram analysis we will identify those user’s ‘writing style’,
namely their command line usage habits. The figure below is shows how we
compare each of our formal language samples:
FORMAL LANGUAGE
User1
User1-history1
User2-History1
User1-history2
User3-History1
User1-history3
User1-History4
User4-History1
User5-History1
Figure 2: method for comparison of formal language samples
We will follow the same procedure for formal language as for natural language,
namely we will analyze each sample, and compare samples by the same user to see
if we can achieve positive identification. We will then compare the samples from
different users to see if we can achieve negative identification.
32
Chapter 4
Results
4.1 Positive Identification using Formal Language
4.1.1 Positive Identification Result of 3 Gram
a. Success and fail for positive identification of T-Test with the same participant (1 person)
Normalization a percentage of n-gram count
Sample 1
Sample 2
User1-history1
User1-history1
User1-history1
User1-history3
User1-history3
User1-history4
User1-history3
User1-history4
User1-history2
User1-history4
User1-history2
User1-history2
A percentage
normalization
1
1
1
1
1
1
Max-min
Z score
normalization
normalization
2.27E-08
1.86E-11
0.194325
9.42E-11
7.23E-09
1.3E-08
1
1
1
1
1
1
Table 1: Positive Identification result of 3-gram
As shown on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation mistakenly
suggests that only User1-history1 and User1-history2 are the same but the others are
different.
33
4.1.2 Positive Identification Result of 5 Gram
a. Success and fail for positive identification of T-Test with the same participant (1 person)
Sample 1
Sample 2
User1-history1
User1-history1
User1-history1
User1-history3
User1-history3
User1-history4
User1-history3
User1-history4
User1-history2
User1-history4
User1-history2
User1-history2
A percentage
normalization
1
1
1
1
1
1
Max-min
Z score
normalization
normalization
0.00915
1.49E-16
0.287977
1.17E-05
0.004408
2.34E-10
1
1
1
1
1
1
Table 2: Positive Identification result of 5-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
only User1-history1 and User1-history2 are the same but the others are different.
34
4.1.3 Positive Identification Result of 11 Gram
a. Success and fail for positive identification of T-Test with the same participant (1 person)
Sample 1
Sample 2
User1-history1
User1-history1
User1-history1
User1-history3
User1-history3
User1-history4
User1-history3
User1-history4
User1-history2
User1-history4
User1-history2
User1-history2
A percentage
normalization
1
1
1
1
1
1
Max-min
Z score
normalization
normalization
3.31E-38
3.32E-14
4.74E-05
0.227777
2.68E-06
0.001243
1
1
1
1
1
1
Table 3: Positive Identification result of 11-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
only User1-history3 and User1-history4 are the same but the others are different
35
4.1.3 Positive Identification Result of 15 Gram
a. Success and fail for positive identification of T-Test with the same participant (1 person)
Sample 1
Sample 2
User1-history1
User1-history1
User1-history1
User1-history3
User1-history3
User1-history4
User1-history3
User1-history4
User1-history2
User1-history4
User1-history2
User1-history2
A percentage
normalization
1
1
1
1
1
1
Max-min
Z score
normalization
normalization
2.1E-168
4.1E-118
2.8E-151
0.304123
0.000854
0.041853
1
1
1
1
1
1
Table 4: Positive Identification result of 15-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
only User1-history3 and User1-history4 are the same but the others are different.
36
4.2 Negative Identification using Formal Language
4.2.1 Negative Identification Result of 3 Gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
User3-history1
User4-history1
User5-history1
User4-history1
User5-history1
A percentage
normalization
1.43E-87
1.2E-183
5.3E-207
1.19E-23
8.04E-34
Max-min
normalization
0.328289
1
1
1.3E-41
0.862822
Z score
normalization
0.998179
0.830453
0.893169
0.838947
0.896617
User2-history1
User2-history1
User2-history1
User3-history1
User3-history1
User4-history1
User5-history1
0.053893
1
0.93775
Table 5: Negative Identification result of 3-gram
As show on the table above the result of normalization a percentage of n-gram counting all
the participants are different except User4-history1 and User5-history1 the t-test say they
are the same person. Min-max normalization claims that only User3-history1 and User4history1 are different and the rest are the same person. However the Z score claims that all
participants are the same person.
37
4.2.2 Negative Identification Result of 5 Gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
User3-history1
User4-history1
User5-history1
User4-history1
User5-history1
A percentage
normalization
1
1
1
1
1
Max-min
normalization
0
0
0
0.003366
7.2E-06
Z score
normalization
1
1
1
1
1
User2-history1
User2-history1
User2-history1
User3-history1
User3-history1
User4-history1
User5-history1
1
0.274439
1
Table 6: Negative Identification result of 5-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score claims 100% they are the same person. However Max-min normalisation claims that
only User4-history1 and User5-history1 are the same person but the others are different.
38
4.2.3 Negative Identification Result of 11 Gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
User3-history1
User4-history1
User5-history1
User4-history1
User5-history1
A percentage
normalization
1
NA
1
NA
1
Max-min
normalization
0
NA
0
NA
1.07E-08
Z score
normalization
1
NA
1
NA
1
User3-history1
User2-history1
User2-history1
User3-history1
User3-history1
User4-history1
User5-history1
NA
NA
NA
Table 7: Negative Identification result of 11-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show User2-history1 vs User3-history1, User3-history1 vs User5-history1 and User3history1 vs User5-history1 they are the same person whereas the others are different.
However Max-min normalisation shows that only User3-history1vs User3-history1 and
User3-history1 vs User5-history1 are the same person but the others are different.
39
4.2.4 Negative Identification Result of 15 Gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
User3-history1
User4-history1
User5-history1
User4-history1
User5-history1
A percentage
normalization
NA
NA
1
NA
NA
Max-min
normalization
NA
NA
0
NA
NA
Z score
normalization
NA
NA
1
NA
NA
User2-history1
User2-history1
User2-history1
User3-history1
User3-history1
User4-history1
User5-history1
NA
NA
NA
Table 8: Negative Identification result of 15-gram
As show on the table above the result of normalization a percentage of n-gram count, Maxmin and Z score shows that User2-history1 vs User5-history1 are the same person whereas
the others are different.
40
4.2.5 User1 and others comparison for 3-gram
Sample 1
Sample 2
User1-history3
A percentage
normalization
1
Max-min
normalization
2.1E-168
Z score
normalization
1
User1-history1
User1-history1
User1-history4
1
4.1E-118
1
User1-history1
User1-history2
1
2.8E-151
1
User1-history1
User2-history1
1
1
3.7E-192
User1-history1
User3-history1
NA
3.6E-211
NA
User1-history1
User4-history1
NA
NA
NA
User1-history1
User5-history1
NA
NA
NA
User1-history3
User1-history4
1
0.304123
1
User1-history3
User1-history2
1
0.000854
1
User1-history3
User2-history1
1
0.31365
1
User1-history3
User3-history1
NA
NA
NA
User1-history3
User4-history1
NA
NA
NA
User1-history3
User5-history1
NA
NA
NA
User1-history4
User1-history2
1
0.041853
1
User1-history4
User2-history1
1
0.085828
1
User1-history4
User3-history1
NA
NA
NA
User1-history4
User4-history1
NA
NA
NA
User1-history4
User5-history1
NA
NA
NA
User1-history2
User2-history1
1
1.46E-05
1
User1-history2
User3-history1
NA
NA
NA
User1-history2
User4-history1
NA
NA
NA
User1-history2
User5-history1
NA
NA
NA
User2-history1
User3-history1
NA
NA
NA
User2-history1
User4-history1
NA
NA
NA
User2-history1
User5-history1
NA
NA
NA
User3-history1
User4-history1
NA
NA
NA
User3-history1
User5-history1
NA
NA
NA
User4-history1
User5-history1
NA
NA
NA
Table 9: User1 and others comparison for 3-gram
41
A percentage normalisation
As show on the table the result of a percentage normalization shows there are some
similarity or positive identification for sample1 and sample2. For example, User1-history1 vs
User1-history3 , User1-history3 vs User1-history4, User1-history1 vs User1-history2, User1history3 vs User1-history4, User1-history3 vs User1-history2 and User1-history4 vs User1history2 those comparisons are belong to the same person and the analysis success
identified them as positive identification.
However, User1-history1 vs User2-history1,
User1-history3 vs User2-history1, User1-history4 vs User2-history1 and User1-history2 vs
User2-history1 shows they are the same person but they are different authors. Furthermore,
for the others machine comparison they success to give negative Identification
Max – min Normalization
The Max – min normalization shows there are some similarity or positive identification such
as User1-history1 vs User2-history1, User1-history3 vs User2-history1 and User1-history4 vs
User2-history1 nonetheless they are different person only User1-history3 vs User1-history4
are the same person. However, for negative identification also gives the unsatisfied result
for instance the comparisons between the same person’ histories (User1) this normalization
say they are different people. Furthermore, the others machine comparison (different
person) shows they successfully perform negative Identification.
42
Z score Normalization
Z score normalization show significant result, for instance User1-history1 vs User1-history3,
User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1-history3 vs User1history4, User1-history3 vs User1-history2 and User1-history4 vs User1-history2 are the
same person or this normalisation success to give positive identification. However, this
method also gives false identification such as User1-history3 vs User2-history1, User1history4 vs User2-history1 and User1-history2 vs User2-history1 are the same person
nonetheless they are different person. In conclusion, this normalization is success to identify
negative identification of different person.
43
4.2.6 User1 and others comparison for 5-gram
a. Success and fail positive and negative identification of T-Test with different participant
Sample 1
Sample 2
User1-history1
User1-history3
A percentage
normalization
Max-min
normalization
Z score
normalization
0.00915
1
1.49E-16
1
0.287977
1
5.34E-13
8.67E-09
1
0.078527
NA
NA
1
0.002453
1.17E-05
1
0.004408
1
0.002023
3.16E-05
0.000436
1
NA
NA
2.5E-05
1
2.34E-10
1
1.24E-12
3.68E-05
0.362585
1
NA
NA
0.832204
1
9.47E-10
1.28E-05
0.185036
1
NA
NA
0.010267
1
0.003355
2.71E-10
0.054025
NA
1
9.35E-13
NA
NA
1
0.337993
NA
NA
1
User1-history1
User1-history4
1
User1-history1
User1-history2
1
User1-history1
User2-history1
5.34E-13
User1-history1
User3-history1
1
User1-history1
User4-history1
NA
User1-history1
User5-history1
1
User1-history3
User1-history4
1
User1-history3
User1-history2
1
User1-history3
User2-history1
6.95E-09
User1-history3
User3-history1
1
User1-history3
User4-history1
NA
User1-history3
User5-history1
1
User1-history4
User1-history2
1
User1-history4
User2-history1
0.00318
User1-history4
User3-history1
1
User1-history4
User4-history1
NA
User1-history4
User5-history1
1
User1-history2
User2-history1
1.27E-09
User1-history2
User3-history1
1
User1-history2
User4-history1
NA
User1-history2
User5-history1
1
User2-history1
User3-history1
NA
User2-history1
User4-history1
NA
User2-history1
User5-history1
1
User3-history1
User4-history1
NA
User3-history1
User5-history1
NA
User4-history1
User5-history1
NA
Table 10: User1 and others comparison for 5-gram
44
A percentage normalisation
As show on the table the result of a percentage normalization shows there are some
similarity or positive identification for sample1 and sample2. For example, User1-history1 vs
User1-history3 , User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1history3 vs User1-history4, User1-history3 vs User1-history2 and User1-history4 vs User1history2 those comparisons are belong to the same person and the analysis success
identified them as positive identification.
However, User1-history1 vs User3-history1,
User1-history1 vs User5-history1, User1-history3 vs User3-history1, User1-history3 Vs
User5-history1, User1-history4 vs User4-history1, User1-history4 vs User5-history1, User1history2 vs User3-history1 and User1-history2 vs User5-history1 shows they are the same
person but they are different authors. Furthermore, for the others machine comparison
they success to give negative identification.
Max – min Normalization
The Max – min normalization is not given sufficient results for positive and negative
identification even though in same case they success to identify nevertheless we cannot
trust the result since this normalization show inconsistently result.
45
Z score Normalization
Z score normalization show significant result, for instance User1-history1 vs User1-history3,
User1-history1 vs User1-history4, User1-history1 vs User1-history2, User1-history3 vs User1history4, User1-history3 vs User1-history2 and User1-history4 vs User1-history2 are the
same person or this normalisation success to give positive identification. However, this
method also gives false identification such as User1-history1 vs User3-history1, User1history3 vs User3-history1, User1-history3 vs User5-history1,
User1-history4 vs User3-
history1, User1-history4 vs User5-history1, User1-history2 vs User3-history1 and User3history1 vs User5-history1 t-test identify they are the same person nonetheless they are
different person. In addition, this normalization is success to identify negative identification
of different person.
46
4.2.7 User1 and others comparison for 11-gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
A percentage
normalization
Max-min
normalization
Z score
normalization
User1-history1
User1-history3
1
3.31E-38
1
User1-history1
User1-history4
1
3.32E-14
1
User1-history1
User1-history2
1
4.74E-05
1
User1-history1
User2-history1
1
9.99E-41
1
User1-history1
User3-history1
NA
NA
NA
User1-history1
User4-history1
NA
NA
NA
User1-history1
User5-history1
1
4.25E-57
1
User1-history3
User1-history4
1
0.227777
1
User1-history3
User1-history2
1
2.68E-06
1
User1-history3
User2-history1
1
0.175096
1
User1-history3
User3-history1
NA
NA
NA
User1-history3
User4-history1
NA
NA
NA
User1-history3
User5-history1
1
0.001546
1
User1-history4
User1-history2
1
0.001243
1
User1-history4
User2-history1
1
0.042027
1
User1-history4
User3-history1
NA
NA
NA
User1-history4
User4-history1
NA
NA
NA
User1-history4
User5-history1
1
0.000526
1
User1-history2
User2-history1
1
7.15E-10
1
User1-history2
User3-history1
NA
NA
NA
User1-history2
User4-history1
NA
NA
NA
User1-history2
User5-history1
1
2.01E-15
1
User2-history1
User3-history1
NA
NA
NA
User2-history1
User4-history1
NA
NA
NA
User2-history1
User5-history1
1
0.002685
1
User3-history1
User4-history1
NA
NA
NA
User3-history1
User5-history1
NA
NA
NA
User4-history1
User5-history1
NA
NA
NA
Table 11: User1 and others comparison for 11-gram
47
Table 11 for user1-history and other comparison also give similar result as 3-gram and 5
gram event though in some machine comparison they still give false identification.
48
4.2.8 User1 and others comparison for 15-gram
a. Success and fail for negative identification of T-Test with different participant
Sample 1
Sample 2
User1-history1
User1-history3
User1-history1
User1-history4
User1-history1
User1-history2
User1-history1
User2-history1
User1-history1
User3-history1
User1-history1
User4-history1
User1-history1
User5-history1
User1-history3
User1-history4
User1-history3
User1-history2
User1-history3
User2-history1
User1-history3
User3-history1
User1-history3
User4-history1
User1-history3
User5-history1
User1-history4
User1-history2
User1-history4
User2-history1
User1-history4
User3-history1
User1-history4
User4-history1
User1-history4
User5-history1
User1-history2
User2-history1
User1-history2
User3-history1
User1-history2
User4-history1
User1-history2
User5-history1
User2-history1
User3-history1
User2-history1
User4-history1
User2-history1
User5-history1
User3-history1
User4-history1
User3-history1
User5-history1
User4-history1
User5-history1
A percentage
normalization
Max-min
normalization
Z score
normalization
1
2.1E-168
1
1
4.1E-118
1
1
2.8E-151
1
1
1
3.7E-192
NA
3.6E-211
NA
NA
NA
NA
NA
NA
NA
1
0.304123
1
1
0.000854
1
1
0.31365
1
NA
NA
NA
NA
NA
NA
NA
NA
NA
1
0.041853
1
1
0.085828
1
NA
NA
NA
NA
NA
NA
NA
NA
NA
1
1.46E-05
1
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
Table 12: User1 and others comparison for 15-gram
Table 12, for user1-history and other comparison also give similar result as 3-gram and 5
gram event though in some machine comparison they still give false identification. Max-min
normalization still gives insufficient results for positive and negative identification
49
4.3 Summary of Formal Language
4.3.1 Positive Identification
Positive Identification (User1 Command Line history for different samples)
n-gram
3 Gram
5 Gram
11 Gram
15 Gram
Normalization
Type
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Correct
Identification
6/6
1/6
6/6
6/6
1/6
6/6
6/6
1/6
6/6
6/6
1/6
6/6
Incorrect
Identification
0/6
5/6
0/6
0/6
5/6
0/6
0/6
5/6
0/6
0/6
5/6
0/6
Rate
Percentage
100%
16.6%
100%
100%
16.6%
100%
100%
16.6%
100%
100%
16.6%
100%
Table 13: Positive Identification summary
The table above is positive identification summary for formal language, for all six possible
pairings of samples, and for the four different n-gram lengths, calculated for each of the
three normalisation methods.
For all four n-gram lengths, we find that the Percentage and Z Score normalisation methods
correctly identify that the user is the same in each case. However the Max-Min
normalisation method fails to identify that the user is the same in all but one case for each
n-gram length.
These results suggest that positive identification can be reliably achieved using the n-gram
analysis method for formal language, using either the Percentage or Z Score normalisation
methods. However, it indicates also that the Max-Min normalisation method is not useful
for positive identification in formal language samples.
50
4.3.2 Negative Identification
Negative Identification User1 vs (user2 vs (user3, user4, user5))
n-gram
Normalization
Type
3 Gram
Percentage
23/28
5/28
82.14
Max-Min
20/28
8/28
71.43
Z Score
19/28
9/28
67.86
Percentage
13/28
15/28
46.43
Max-Min
17/28
11/28
60.71
Z Score
14/28
14/28
50.00
Percentage
16/28
12/28
57.14
Max-Min
26/28
2/28
92.86
Z Score
16/28
14/28
57.14
Percentage
23/28
5/28
82.14
Max-Min
24/28
4/28
85.71
Z Score
24/28
4/28
85.71
5 Gram
11 Gram
15 Gram
Correct
Identification
Incorrect
Identification
Rate
Percentage
Table 14: Negative Identification summary
The table above is a negative identification summary for formal language, for all possible
pairings of samples, and for the four different n-gram lengths, calculated for each of the
three normalisation methods.
The results are less clear than were observed in the positive identification table. The MaxMin normalisation method is correct between 60.71% and 92.86% of the time, showing an
improvement over its use in positive identification. The other two normalisation methods
were not as reliable as in the positive identification tests
51
Chapter 5
Results
5.1 Positive identification using natural language
5.1.1 Positive Identification Result of 3 Gram
a. Success and fail for positive identification of T-Test with the same participant for Jane
Austin (1 person)
Normalization a percentage of n-gram count
Novel 1
Novel2
Z score
Max-min
Z score
normalization normalization normalization
Pride and Prejudice
Sense and Sensibility
1
2.52E-22
1
Pride and Prejudice
Mansfield Park
1
0.170681
1
Pride and Prejudice
Emma
1
1E-18
1
Mansfield Park
Sense and Sensibility
1
0.533889
1
Mansfield Park
Mansfield Park
1
1.49E-24
1
Sense and Sensibility
Mansfield Park
1
1.04E-39
1
Table 15: Positive Identification result of 3-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
Pride and Prejudice vs Mansfield Park, Pride and Prejudice vs Emma and Emma vs Mansfield
Park are the same but the others are different.
52
5.1.2 Positive Identification Result of 5 Gram
a. Success and fail for positive identification of T-Test with the same participant for Jane
Austin (1 person)
Novel 1
Novel2
Pride and Prejudice
Sense and Sensibility
Pride and Prejudice
Mansfield Park
Pride and Prejudice
Emma
Mansfield Park
Sense and Sensibility
Mansfield Park
Mansfield Park
Sense and Sensibility
Mansfield Park
Z score
Max-min
Z score
normalization normalization normalization
1
0
1
1
4.4E-146
1
1
2.8E-108
1
1
1E-175
1
1
8.24E-05
1
1.2E-164
1
1
Table 16: Positive Identification result of 5-gram
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
only Pride Prejudice vs Sense and Sensibility are the same but the others are different.
53
5.1.3 Positive Identification Result of 11 Gram
a. Success and fail for positive identification of T-Test with the same participant for Jane
Austen (1 person)
Novel 1
Novel2
Pride and Prejudice
Sense and Sensibility
Pride and Prejudice
Mansfield Park
Pride and Prejudice
Emma
Mansfield Park
Sense and Sensibility
Mansfield Park
Mansfield Park
Sense and Sensibility
Mansfield Park
Z score
Max-min
Z score
normalization normalization normalization
1
0
1
1
0
1
1
0
1
1
9.55E-71
1
1
2E-99
1
1
0.035547
Table 17: Positive Identification result of 11-gram
1
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
Pride and Prejudice vs sense and Sensibility, Pride and Prejudice vs Mansfield Park and Pride
and Prejudice vs Emma are the same person but the other are different person.
54
5.1.4 Positive Identification Result of 15 Gram
a. Success and fail for positive identification of T-Test with the same participant for Jane
Austin (1 person)
Novel 1
Novel2
Pride and Prejudice
Sense and Sensibility
Pride and Prejudice
Mansfield Park
Pride and Prejudice
Emma
Mansfield Park
Sense and Sensibility
Mansfield Park
Mansfield Park
Sense and Sensibility
Mansfield Park
Z score
Max-min
Z score
normalization normalization normalization
1
0
1
1
0
1
1
0
1
1
5.16E-29
1
1
1.75E-20
1
1
1.58E-119
Table 18: Positive Identification result of 15-gram
1
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
Pride and Prejudice vs Sense and Sensibility, Pride and Prejudice vs Mansfield Park and Pride
and Prejudice vs Emma are the same person but the other are different person.
55
5.2 Negative Identification Using Natural Language
5.2.1 Negative Identification Result of 3 Gram
a. Success and fail for negative identification of T-Test with different participant for
William Shakespeare and Jane Austen (2 person)
Normalization a percentage of n-gram count
Novel 2
Novel 2
Z score
normalization
Max-min
normalization
Z score
normalization
Julius Caesar
Romeo and Juliet
1
4.21E-08
1
Julius Caesar
Hamlet
1
1.5E-101
1
Julius Caesar
Sonnet
1
7.5E-108
1
Julius Caesar
Sense and Sensibility
1
2.2E-142
1
Julius Caesar
Pride and Prejudice
1
2.1E-108
1
Julius Caesar
Mansfield Park
1
9.12E-62
1
Julius Caesar
Emma
1
3E-104
1
Romeo and Juliet
Hamlet
1
1.4E-120
1
Romeo and Juliet
Sonnet
1
8.5E-147
1
Romeo and Juliet
Sense and Sensibility
1
3.2E-112
1
Romeo and Juliet
Pride and Prejudice
1
1.06E-63
1
Romeo and Juliet
Mansfield Park
1
8.4E-126
1
Romeo and Juliet
Emma
1
1.95E-71
1
Hamlet
Sonnet
1
4.1E-17
1
Hamlet
Sense and Sensibility
1
0.000987
1
Hamlet
Pride and Prejudice
1
2.05E-26
1
Hamlet
Mansfield Park
1
3.19E-09
1
Hamlet
Emma
1
3.8E-116
1
Sonnet
Sense and Sensibility
1
1.6E-71
1
Sonnet
Pride and Prejudice
1
5.77E-31
1
Sonnet
Mansfield Park
1
2.98E-94
1
Sonnet
Emma
1
8.37E-48
1
Sense and Sensibility
Pride and Prejudice
1
6.07E-87
1
Sense and Sensibility
Mansfield Park
1
9.96E-05
1
Sense and Sensibility
Emma
1
1.85E-22
1
Pride and Prejudice
Mansfield Park
1
4.71E-25
1
Pride and Prejudice
Emma
1
1.17E-64
1
Mansfield Park
Emma
1
1.17E-64
1
Table 19: Negative Identification result of 3-gram
56
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
both participants are different.
57
5.2.2 Negative Identification Result of 5 Gram
a. Success and fail for negative identification n of T-Test with different participant for
William Shakespeare and Jane Austin (2 person)
Novel 2
Novel 2
Z score
Max-min
normalization normalization
1
2.37E-87
Julius Caesar
Romeo and Juliet
1
3.2E-162
Julius Caesar
Hamlet
1
1.5E-226
Julius Caesar
Sonnet
1
0.00591
Julius Caesar
Sense and Sensibility
1
2.34E-22
Julius Caesar
Pride and Prejudice
1
0.000294
Julius Caesar
Mansfield Park
1
2.07E-22
Julius Caesar
Emma
1
0.000195
Romeo and Juliet
Hamlet
1
8.91E-47
Romeo and Juliet
Sonnet
1
2.82E-49
Romeo and Juliet
Sense and Sensibility
1
1.15E-18
Romeo and Juliet
Pride and Prejudice
1
2.16E-54
Romeo and Juliet
Mansfield Park
1
2.52E-24
Romeo and Juliet
Emma
1
1.87E-31
Hamlet
Sonnet
1
8.93E-88
Hamlet
Sense and Sensibility
1
3.05E-36
Hamlet
Pride and Prejudice
1
2.42E-88
Hamlet
Mansfield Park
1
3.26E-47
Hamlet
Emma
1
1.2E-161
Sonnet
Sense and Sensibility
1
9.8E-108
Sonnet
Pride and Prejudice
1
2.7E-164
Sonnet
Mansfield Park
1
1.7E-122
Sonnet
Emma
1
1.49E-09
Sense and Sensibility
Pride and Prejudice
1
0.614744
Sense and Sensibility
Mansfield Park
1
3.64E-08
Sense and Sensibility
Emma
1
3.37E-30
Pride and Prejudice
Mansfield Park
1
0.045969
Pride and Prejudice
Emma
1
8.66E-29
Mansfield Park
Emma
Table 20: Negative Identification result of 5-gram
Z score
normalization
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However Max-min normalisation shows that
only Sense and Sensibility vs Mansfield Park are the same and the others are different.
58
5.2.3 Negative Identification Result of 11 Gram
a. Success and fail for negative identification of T-Test with different participant for
William Shakespeare and Jane Austin (2 person).
Novel 2
Novel 2
Z score
normalization
Max-min
normalization
Z score
normalization
Julius Caesar
Romeo and Juliet
1
na
1
Julius Caesar
Hamlet
1
0
1
Julius Caesar
Sonnet
1
3.30392E-39
1
Julius Caesar
Sense and Sensibility
1
1.18605E-32
1
Julius Caesar
Pride and Prejudice
1
7.98933E-07
1
Julius Caesar
Mansfield Park
1
9.62269E-21
1
Julius Caesar
Emma
1
3.26764E-06
1
Romeo and Juliet
Hamlet
1
0
1
Romeo and Juliet
Sonnet
1
3.30392E-39
1
Romeo and Juliet
Sense and Sensibility
1
1.18605E-32
1
Romeo and Juliet
Pride and Prejudice
1
7.98933E-07
1
Romeo and Juliet
Mansfield Park
1
9.62269E-21
1
Romeo and Juliet
Emma
1
3.26764E-06
1
Hamlet
Sonnet
1
1.0359E-21
1
Hamlet
Sense and Sensibility
1
9.87798E-17
1
Hamlet
Pride and Prejudice
1
0.236207615
1
Hamlet
Mansfield Park
1
4.82439E-05
1
Hamlet
Emma
1
0.000530174
1
Sonnet
Sense and Sensibility
1
0.289320395
1
Sonnet
Pride and Prejudice
1
6.93376E-22
1
Sonnet
Mansfield Park
1
3.68687E-10
1
Sonnet
Emma
1
2.30855E-28
1
Sense and Sensibility
Pride and Prejudice
1
3.93646E-16
1
Sense and Sensibility
Mansfield Park
1
9.64117E-07
1
Sense and Sensibility
Emma
1
1.72025E-21
1
Pride and Prejudice
Mansfield Park
1
2.04396E-08
1
Pride and Prejudice
Emma
1
0.088959355
1
Mansfield Park
Emma
1
5.5606E-14
1
Table 21: Negative Identification result of 11-gram
59
As show on the table above the result of normalization a percentage of n-gram count and Z
score show 100% they are the same person. However, Max-min normalisation shows that,
there are five comparison show the same result such as Pride and Prejudice vs Emma
(same author), Sonnet vs Sense and Sensibility (different author), Hamlet and Pride and
Prejudice (different author), Romeo and Juliet vs Emma (different author) and Julius Caesar
and Hamlet (same author) and the others are different.
60
5.2.4 Negative Identification Result of 15 Gram
a. Success and fail for negative identification of T-Test with different participant for
William Shakespeare and Jane Austin (2 person)
Novel 2
Novel 2
Z score
normalization
Max-min
normalization
Julius Caesar
Romeo and Juliet
1
3.7E-246
1
Julius Caesar
Hamlet
1
4.02E-09
1
Julius Caesar
Sonnet
0
6.35E-45
1
Julius Caesar
Sense and Sensibility
1
1.84E-37
1
Julius Caesar
Pride and Prejudice
1
1.36E-37
1
Julius Caesar
Mansfield Park
1
1.88E-19
1
Julius Caesar
Emma
1
8.79E-31
1
Romeo and Juliet
Hamlet
1
4.7E-117
1
Romeo and Juliet
Sonnet
0
2.4E-102
1
Romeo and Juliet
Sense and Sensibility
1
2.1E-109
1
Romeo and Juliet
Pride and Prejudice
1
2.4E-102
1
Romeo and Juliet
Mansfield Park
1
2.1E-109
1
Romeo and Juliet
Emma
1
9.85E-55
1
Hamlet
Sonnet
1
NA
1
Hamlet
Sense and Sensibility
1
0.040804
1
Hamlet
Pride and Prejudice
1
0.084871
1
Hamlet
Mansfield Park
1
0.263444
1
Hamlet
Emma
1
0.145122
NA
1
Sonnet
Sense and Sensibility
1.87E-11
Z score
normalization
1
NA
Sonnet
Pride and Prejudice
1.46E-12
1
NA
Sonnet
Mansfield Park
5.69E-20
1
NA
Sonnet
Emma
Sense and Sensibility
Pride and Prejudice
2.03E-07
1
0.099914
1
1
Sense and Sensibility
Mansfield Park
1
0.566338
1
Sense and Sensibility
Emma
1
0.330546
1
Pride and Prejudice
Mansfield Park
1
0.860759
1
Pride and Prejudice
Emma
1
0.860759
1
Mansfield Park
Emma
1
0.523756
1
Table 22: Negative Identification result of 15-gram
61
A Percentage normalization
For 15-gram of Normalization a percentage show that the Sonnet vs Sense and Sensibility,
Pride and Prejudice, Mansfield Park, Emma are different and the other are the same.
Max-min
For 15-gram of max-min normalization show there are some similarity of novel, such as ;
Hamlet vs Pride and Prejudice (different author),
Hamlet vs Mansfield Park (different author),
Hamlet vs Emma (different author),
Sense and Sensibility vs Pride and Prejudice (same author)
Sense and Sensibility vs Mansfield Park (same author)
Sense and Sensibility vs Emma (same author)
Pride and Prejudice vs Mansfield Park (same author)
Pride and Prejudice vs Emma (same author)
Mansfield Park vs Emma (same author)
and the other novels are different.
Z score
15-gram of Z score normalisation shows they are similarity for some comparison and there
are five comparison shows they are different.
62
5.3 Summary of Natural Language
5.3.1 Positive Identification
n-gram
Normalization
Type
3 Gram
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
5 Gram
11 Gram
15 Gram
Positive Result
(Correct
Identification)
18/18
2/18
18/18
18/18
2/18
18/6
18/18
6/18
18/18
18/18
9/18
18/18
Negative Result
(False
Identification )
0/18
16/18
0/18
0/18
16/18
0/18
0/18
12/18
0/18
0/18
9/18
0/18
Rate
Percentage
100%
11.11%
100%
100%
11.11%
100%
100%
33.33%
100%
100%
50%
100%
Table 23: Positive Identification summary
The table above is a positive identification summary for natural language, for all possible
pairings of same-author samples, and for the four different n-gram lengths, calculated for
each of the three normalisation methods.
Firstly, for the 3-gram percentage normalization and Z score the percentage rate are 100%
success. However, Max-min’s percentage rate only 11, 11% similarity for the pair
comparison in User1’s command line history for different machine. It means that Max-Min
normalization fail to identify positive Identification. Secondly, for 5-gram, 11-gram and 15gram the percentage normalization and Z score are 100% success for positive identification.
On the other hand, the Max-min gives different result for each gram. For instance 5-gram
show 11,11% same as 3-gram, 11-gram is 33,33% and 15-gram is 50%.
63
5.3.2 Negative Identification
n-gram
Normalization
Type
3 Gram
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
Percentage
Max-Min
Z Score
5 Gram
11 Gram
15 Gram
Positive Result
(Correct
Identification)
0/16
16/16
0/16
0/16
16/16
0/16
0/16
16/16
0/16
0/16
2/16
0/16
Negative Result
(False
Identification )
16/16
0/16
16/16
16/16
0/16
16/16
16/16
0/16
16/16
16/16
14/16
16/16
Rate
Percentage
0%
100%
0%
0%
100%
0%
0%
100%
0%
0%
12.5%
0%
Table 24: Negative Identification summary
The table above is a negative identification summary for natural language, for all possible
pairings of different-author samples, and for the four different n-gram lengths, calculated
for each of the three normalisation methods.
The table above is the summary of negative identification shows the unsatisfactory result
for each gram. The table 24 negative identification summary show how negative
identification for natural language fails for user identification. However, for Max-min
normalization especially for 3-gram, 5-gram, and 11-gram show that we success 100% to
identify the negative identification. However we cannot trust max-min normalization since
in both formal language and natural language max-min normalization show the result
always different.
64
Chapter 6
Conclusion and Future Work
6.1 Conclusion
In this thesis, we investigate user writing styles which aims to be able to identify users
positively and negatively. We investigate formal language and natural language by use ngram methodology. There are five participants in formal language and two famous writers
for natural language. We compare the result of n-gram analyses from each participant and
assess how successful this comparison by use t-test for paired two samples for means. The
result shows that formal language can identify users in term of positive and negative
identification. However, for natural language, the n-gram analysis is successful for positive
identification but not for negative identification. Thus, formal language is shown to be more
generally accurate.
Research question 1: does the user of n-gram analysis to profile users’ command usage in
their command line histories allow accurate user identification?
a. Positive Identification of Formal Language
Normalization Type
Percentage
Max-min
z Score
Success Total
100%
16.66%
100%
Table 25: Success total
The investigation of positive investigations is show successfully for identify the same user
especially for a percentage and z score normalization both normalization show 100%
success but not for max-min normalization only 16.66%.
65
b. Negative Identification of Formal Language
Normalization Type
Percentage
Success Total
66.96
Max-min
77.67
z Score
65.17
Table 26: Success total
The investigation of negative investigations is show successfully for identify the same user
for every type of normalization a percentage provide 66.96 %, Max-min 77.67% and z score
65.17%. In addition, Negative investigation also show satisfied result, this investigation show
each type of normalization success to identify different user event though they have
different result in term of percentage.
Research question 2: Does the use of n-gram analysis to profile users’ writing styles in
natural language situation allow accurate user identification?
a. Positive Identification of Natural Language
Normalization
Success
Type
Total
Percentage
100%
Max-min
26.38%
z Score
100%
Table 27: Success total
The investigation of positive investigations is show successfully for identify the same user
especially for a percentage and z score normalization, both normalization show 100%
success but not for max-min normalization only show 26.38%.
66
b. Negative Identification of Natural Language
Normalization
Type
Success
Total
Percentage
0%
Max-min
100.00%
z Score
0%
Table 28: Success total
The investigation of Negative investigations is show unsuccessful for identify the same user
for percentage and z score normalization both type of normalization show 0% to identify a
user. However, Max-min normalization is success to identify positive identification but we
cannot trust Max-min normalization due to this type of normalization always show
unreliable answer.
Research question 3: if the profiling of both writing styles and command usage allow
accurate user profiling, which is the most accurate?
According to investigation result, formal language show accurate information in term of
investigation user profiling. It is because formal language success to identify positive and
negative user identification.
6.2 Future Work
This main objective of this research is to identify a user’s writing style in term of formal
language and natural language especially for user reauthentication. This research used ngram methods to assess the user in two writing styles. The investigation used software that
created from previous researcher for counting n-gram. The software will store the counting
gram from the history file of the participants into excel file. After that we use excel to create
n-gram spectrum and from n-gram spectrum we compare each participants to see the
similarity and different their writing style.
67
Further experiment have to be made for continue the investigation, as follows.
Firstly, for formal language we can investigate by divided to period of time for instance per
month or week, rather than compare different machine in different work place. It is because
working in office or home could be different in term of psychologies.
Secondly, we should try another gram, such as 1,2,4,6,7,8,9,10,12,13, since every gram
length appears to show a different result, and another gram length could give a more
accurate result for formal and natural language.
Thirdly, we can assume one person could have more than one writing style. Try to collect
their writing style and match it to each other user’s writing style.
Finally, another analysis should be to compare each of the participants in both directions.
For example, after compare ‘A’ person writing style to ‘B’ person writing style, we should
swap the order, so ‘B’ must compared to ‘A’ as well. Thus, from both result we can compare
it and see how the result.
6.3 Summary
In summary this thesis has shown that the use of n-gram analysis for identifying users in a
reauthentication situation is feasible and that further research on this area will deliver
additional results.
68
References:
ABOU-ASSALEH, T., CERCONE, N., KESELJ, V. & SWEIDAN, R. Year. n-gram-based detection of new
malicious code. In: Computer Software and Applications Conference, 2004. COMPSAC 2004.
Proceedings of the 28th Annual International, 28-30 Sept. 2004 2004. 41-42 vol.2.
Ahmed, A.A.E, 2005, 'Detecting computer intrusions using behavioral biometrics'.
Alexandre, TJ 1997, 'Biometrics on smart cards: An approach to keyboard behavioral signature',
Future Generation Computer Systems, vol. 13, no. 1, pp. 19-26.
Almassian, N, Azmi, R & Berenji, S 2009, 'AIDSLK: An Anomaly Based Intrusion Detection System in
Linux Kernel', Information Systems, Technology and Management, pp. 232-243.
Ashman, H & Holland, S 'Profiling and identifying users with n-gram analysis on their command line
histories' (in draft).
Balduzzi, M, Platzer, C, Holz, T, Kirda, E, Balzarotti, D & Kruegel, C 2010, 'Abusing Social Networks for
Automated User Profiling', in Jha, S, Sommer, R & Kreibich, C (eds), Recent Advances in Intrusion
Detection, vol. 6307, Springer Berlin / Heidelberg, pp. 422-441.
BHATTACHARYYA, P., GARG, A.&WU, S. F.Social Network Model Based on Keyword Categorization.
Social Network Analysis and Mining, 2009. ASONAM '09. International Conference on Advances in,
20-22 July 20092009. 170-175.
CANALI, C., CASOLARI, S. & LANCELLOTTI, R. A quantitative methodology to identify relevant users in
social networks. 2010.
Cha, B 2005, 'Host Anomaly Detection Performance Analysis Based on System Call of Neuro-Fuzzy
Using Soundex Algorithm and n-gram Technique', Proceedings of the 2005 Systems Communications.
COLES, R.&HODGKINSON, G. P.2008. A Psychometric Study of Information Technology Risks in the
Workplace. Risk Analysis,28,81-93.
DWYER, C., HILTZ, S. R. & PASSERINI, K. Year. Trust and privacy concern within social networking sites:
A comparison of Facebook and MySpace. In, 2007. Citeseer.
FELT, A. & EVANS, D. 2008. Privacy protection for social networking APIs. 2008 Web 2.0 Security and
Privacy (W2SPí08).
GRZECH, A. 2006. Anomaly detection in distributed computer communication systems. Cybernetics
and Systems, 37, 635-652.
GUO, L., TAN, E., CHEN, S., ZHANG, X. & ZHAO, Y. E. Year. Analyzing patterns of user content
generation in online social networks. In, 2009. ACM, 369-378.
HUBBALLI, N., BISWAS, S. & NANDI, S. Year. Sequencegram: n-gram modeling of system calls for
program based anomaly detection. In: Communication Systems and Networks (COMSNETS), 2011
Third International Conference on, 4-8 Jan. 2011 2011. 1-10.
69
Hung, J,Y, Huang Y,C, Hsu, J, Y & Wu, D, K, C 2008, 'Tag-Based user profiling for social media
recommendation'.
Keselj, FPDSV & Wang, S 'Language Independent Authorship Attribution using Character Level
Language Models'.
KEÖELJ, V., PENG, F., CERCONE, N. & THOMAS, C. Year. n-gram-based author profiles for authorship
attribution. In, 2003. Citeseer.
LUO, F., OU, Q.&WEI, G.Research on n-gram-based malicious code feature extraction algorithm.
Computer Application and System Modeling (ICCASM), 2010 International Conference on, 22-24 Oct.
20102010. V6-89-V6-92.
Maia, M, Almeida, J, Virg & Almeida, l 2008, 'Identifying user behavior in online social networks',
Proceedings of the 1st Workshop on Social Network Systems, Glasgow, Scotland.
MATYA, X, X030C, V. & IHA, Z. Year. Security of biometric authentication systems. In: Computer
Information Systems and Industrial Management Applications (CISIM), 2010 International
Conference on, 8-10 Oct. 2010 2010. 19-28.
McKinney, S & Reeves, DS 2009, 'User identification via process profiling: extended abstract',
Proceedings of the 5th Annual Workshop on Cyber Security and Information Intelligence Research:
Cyber Security and Information Intelligence Challenges and Strategies, Oak Ridge, Tennessee.
N.P.Dau, V, Rau, V & J.Templeton, S 2000, 'profiling users in the UNIX OS Environment'.
Pannell, G & Ashman, H 2010, 'User Modelling for Exclusion and Anomaly Detection: A Behavioural
Intrusion Detection System', in De Bra, P, Kobsa, A & Chin, D (eds), User Modeling, Adaptation, and
Personalization, vol. 6075, Springer Berlin / Heidelberg, pp. 207-218.
OKAZAKI, Y., SATO, I. & GOTO, S. A new intrusion detection method based on process profiling.
Applications and the Internet, 2002. (SAINT 2002). Proceedings. 2002 Symposium on, 2002 2002. 8290.
Pepyne, D,L , Hu, J & Gong W 2004, 'User profiling for computer security'.
Reddy, DKS & Pujari, AK 2006, 'n-gram analysis for computer virus detection', Journal in
Computer Virology, vol. 2, no. 3, pp. 231-239.
REFORMAT, M.&GOLMOHAMMADI, S. K. Year. Updating user profile using ontology-based semantic
similarity. In: Fuzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference on, 20-24 Aug.
20092009. 1062-1067.
TAKEDA, M., MATSUMOTO, T., FUKUDA, T. & NANRI, I. Year. Discovering characteristic expressions
from literary works: A new text analysis method beyond n-gram statistics and KWIC. In, 2000.
Springer, 112-126.
THELWALL, M. 2008. Social networks, gender, and friending: An analysis of MySpace member
profiles. Journal of the American Society for Information Science and Technology, 59, 1321-1330.
70
WEI, W., XIAOHONG, G. & XIANGLIANG, Z. Profiling program and user behaviors for anomaly
intrusion detection based on non-negative matrix factorization. Decision and Control, 2004. CDC.
43rd IEEE Conference on, 14-17 Dec. 2004 2004. 99-104 Vol.1.
Zhang, B, Yin, J, Hao, J, Wang, S & Zhang, D 2007, 'New Malicious Code Detection Based on N-Gram
Analysis and Rough Set Theory', in Wang, Y, Cheung, Y-m & Liu, H (eds), Computational Intelligence
and Security, vol. 4456, Springer Berlin / Heidelberg, pp. 626-633.
Appendix A
71
Download