Stamatatos - unb-dw

Summary of “A Survey of Modern Authorship Attribution Methods” Stamatatos’s paper provides a survey of the different attribution methods and their text representation and text-classification characteristics. It also talks about the computational requirements of these methods. The idea of authorship identification is that by measuring certain textual features, texts written by different authors can be distinguished. There are different stylometric features that allow one to measure writing style. These stylometric features are classified in: Lexical, Character, Syntactic, Semantic, and Application-based features. The requirements in tools and resources depend on the stylometric feature that is being measured. Also, some features’ measurements are more suitable for some texts than others. Lexical features view a text as a sequence of tokens. They can be measured for any language but they are difficult to apply for some languages, such as Chinese. These features are measured in methods such as: 1. Vocabulary richness methods: which attempt to quantify the diversity of the vocabulary in a text. 2. Word-n-grams methods: which measure phrases of n contiguous words, instead of tokens alone providing more contextual information. Character features view text as a sequence of characters. These features are languageindependent and require no special tools. Their dimensionality, however, is increased when compared to Lexical features. They are used in methods that measure features such as, character types, character n-grams of fixed length and character n-grams of variable length, and in compression methods. Syntactic features are more complex to measure and require robust and accurate NLP tools to perform syntactic analysis of texts. They are language-dependent, because they rely on the availability of a parser for the text’s language. Methods of syntactic features may measure: part-of-speech, chunks, rewrite rules frequencies, and errors. Semantic features are even more complex than syntactic features and require semantic analysis that is not yet handled adequately by NLP technology for unrestricted text. Features in this category include synonyms and semantic relationships such as hyponyms and hypernyms. Few methods attempt to measure these semantic features. Overall, the more detailed the text analysis required to extract stylometric features, the less accurate the produced measures. Therefore, methods requiring syntactic parsing, or semantic analysis can become noisy and inaccurate. Measuring the stylometric features is only part of the task for authorship identification. Identifying the author for a given piece of text is the other part, and there are two approaches: profile-based and instance-based. The profile-based approach concatenates all the available training texts per author in a single file. The author’s style or profile is extracted from the concatenated text. Differences between texts written by the same author are disregarded. The profile-based approach has a training phase where it extracts the profiles for the candidate authors. From the training phase it develops an attribution model. The attribution model is usually based on a distance function. This function computes the differences of the profile of an unseen text, one for which authorship is unknown, and the profile of each candidate author. The author profile with the minimum distance to the unseen text is the most likely author. The function may also be probabilistic. In that case, the profile with the maximum probability is the most likely author. Other functions exist such as the common n-grams (CNG) function. This function calculates the dissimilarity between two profiles by computing the relative difference between their common n-grams. This approach has been applied successfully to various authorship-identification experiments. It is, however, only useful when the training corpora of different authors are relatively balanced in length. To overcome this problem, another function, the Simplified Profile Intersection (SPI), is proposed. This function simply counts the number of common n-grams of the two profiles and disregards the rest. This measure provided better results for source code authorship identification than the CNG function. There are other variations of the CNG dissimilarity function. The instance-based approach takes multiple training text samples per author and develops an attribution model. It does not disregard differences between texts written by the same author. In instance based approaches a classification algorithm is trained to develop an attribution model. Then, this model attempts to estimate the true author of an unseen text. In instancebased approaches, training texts must be segmented into multiple parts, probably of equal length. When there are multiple training texts of different lengths per authors, they should be normalized so that the texts per author are segmented to equal-sized samples. There are also hybrid approaches that combine characteristics of both profile-based and instance-based methods.

Stamatatos - unb-dw

Related documents

Products

Support

Stamatatos - unb-dw

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib