Stamatatos - unb-dw

advertisement
Summary of “A Survey of Modern Authorship Attribution Methods”
Stamatatos’s paper provides a survey of the different attribution methods and their text
representation and text-classification characteristics. It also talks about the computational
requirements of these methods. The idea of authorship identification is that by measuring
certain textual features, texts written by different authors can be distinguished. There are
different stylometric features that allow one to measure writing style. These stylometric
features are classified in: Lexical, Character, Syntactic, Semantic, and Application-based
features. The requirements in tools and resources depend on the stylometric feature that is
being measured. Also, some features’ measurements are more suitable for some texts than
others.
Lexical features view a text as a sequence of tokens. They can be measured for any language
but they are difficult to apply for some languages, such as Chinese. These features are
measured in methods such as:
1. Vocabulary richness methods: which attempt to quantify the diversity of the vocabulary
in a text.
2. Word-n-grams methods: which measure phrases of n contiguous words, instead of
tokens alone providing more contextual information.
Character features view text as a sequence of characters. These features are languageindependent and require no special tools. Their dimensionality, however, is increased when
compared to Lexical features. They are used in methods that measure features such as,
character types, character n-grams of fixed length and character n-grams of variable length, and
in compression methods.
Syntactic features are more complex to measure and require robust and accurate NLP tools to
perform syntactic analysis of texts. They are language-dependent, because they rely on the
availability of a parser for the text’s language. Methods of syntactic features may measure:
part-of-speech, chunks, rewrite rules frequencies, and errors.
Semantic features are even more complex than syntactic features and require semantic analysis
that is not yet handled adequately by NLP technology for unrestricted text. Features in this
category include synonyms and semantic relationships such as hyponyms and hypernyms. Few
methods attempt to measure these semantic features.
Overall, the more detailed the text analysis required to extract stylometric features, the less
accurate the produced measures. Therefore, methods requiring syntactic parsing, or semantic
analysis can become noisy and inaccurate.
Measuring the stylometric features is only part of the task for authorship identification.
Identifying the author for a given piece of text is the other part, and there are two approaches:
profile-based and instance-based.
The profile-based approach concatenates all the available training texts per author in a single
file. The author’s style or profile is extracted from the concatenated text. Differences between
texts written by the same author are disregarded. The profile-based approach has a training
phase where it extracts the profiles for the candidate authors. From the training phase it
develops an attribution model. The attribution model is usually based on a distance function.
This function computes the differences of the profile of an unseen text, one for which
authorship is unknown, and the profile of each candidate author. The author profile with the
minimum distance to the unseen text is the most likely author. The function may also be
probabilistic. In that case, the profile with the maximum probability is the most likely author.
Other functions exist such as the common n-grams (CNG) function. This function calculates the
dissimilarity between two profiles by computing the relative difference between their common
n-grams. This approach has been applied successfully to various authorship-identification
experiments. It is, however, only useful when the training corpora of different authors are
relatively balanced in length. To overcome this problem, another function, the Simplified Profile
Intersection (SPI), is proposed. This function simply counts the number of common n-grams of
the two profiles and disregards the rest. This measure provided better results for source code
authorship identification than the CNG function. There are other variations of the CNG
dissimilarity function.
The instance-based approach takes multiple training text samples per author and develops an
attribution model. It does not disregard differences between texts written by the same author.
In instance based approaches a classification algorithm is trained to develop an attribution
model. Then, this model attempts to estimate the true author of an unseen text. In instancebased approaches, training texts must be segmented into multiple parts, probably of equal
length. When there are multiple training texts of different lengths per authors, they should be
normalized so that the texts per author are segmented to equal-sized samples.
There are also hybrid approaches that combine characteristics of both profile-based and
instance-based methods.
Download