Author N - SCRUB (Secure Computing Research for Users` Benefit)

advertisement
JStylo: An Authorship-Attribution Platform
and its Applications
Introduction
JStylo is a platform designed to conduct supervised stylometry
experiments – authorship attribution using linguistic style.
It uses NLP techniques to extract linguistic features from
documents, and supervised machine learning methods to
classify those documents based on the extracted features.
The platform feature extraction core is based on the JGAAP
API [1], and the classifiers available include Weka [2]
classifiers and an implementation of the Writeprints [3]
classifier.
Platform Overview
1. Problem
Definition
Training Documents
[1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual
Analysis, Text Categorization, and Authorship Attribution (2009)
2. Feature
Selection
[5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship
Recognition Techniques (2009)
Author 1 Author 2
Feature Set
f1
f2
Learn Styles
Suggest Changes
Author N
…
fM
f3
Document
preprocessing
Change Document
Feature
Extraction
My docs
Feature postprocessing
NO
Normalization
Factoring
Check if Anonymized
3. Classifiers
Selection
Document to
Anonymize
cL
c1 c2 c3
YES
Document
Anonymized
4. Analysis
Feature
Extraction
Training
Documents
Classification
Results
A1
A2
Document pre-process
Training Set
CV Results
c1
…
Train
AN
Test
cL
12 289 5.61 … 13.7
Train
Feature post-process
?
?
?
A3 A15
1.2 5.78
5
Candidate
languages
A7
…
Test
…
F1
F2
F3
… 41.1
A sample evaluation using the Writeprints feature set with
Weka SMO SVM classifier on the Extended BrennanGreenstadt Adversarial corpus [5]:
• 45 authors
• > 6,500 words per author, divided into ~500-words
documents
• 10-fold cross-validation:
97,67
20
94,62
15
Candidate
families
Candidate
languages
F1 F2 F3
L11 L12 L13
Fi
Classifier
Result
Li1 Li2 Li3
L21 L22 L23
L31 L32 L33
Classify
language
Evaluation
5
Personal Traits Identification: Native Language
Using Language-Family Information
• Classify documents by native language
• Set the classification probabilities as threshold T
• Use language-family reclassification for instances classified
with probability p < T to improve language classification
…
Feature Extraction
Test
Documents
Number of Authors
[4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt,
R.: Use Fewer Instances of the Letter "i": Toward Writing Style
Anonymization (2012)
?
Feature
[2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten,
I.: The Weka Data Mining Software: An Update (2009)
[3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to IdentityLevel Identification and Similarity Detection in Cyberspace (2008)
?
Document Anonymization
Using Anonymouth [4]
• JStylo as an authorship-attribution engine to evaluate
anonymization level
“Blend-in” Corpus
Novelty
References
?
…
Motivation
• Cumulative feature-set analysis (vs. one feature at-a-time)
• Added feature extractors and processing tools
• Readability / complexity metrics
• Regular-expression-based features
• Counters (word / letter / regular expression)
• High feature-level customizability
• Factoring and Normalization
• Uses Weka classifiers
• Provides implementation of Writeprints
Test Documents
Author 1 Author 2 Author N
Source: https://github.com/psal/JStylo-Anonymouth
• Important for research in history, literature and forensics
• Impact on privacy and anonymity in online environments:
• Reveal identity: users can use various tools to hide their
location, but their writing style may still be exposed. JStylo
provides a convenient platform for developing methods to
reveal anonymous identities.
• Preserve anonymity: On the other hand JStylo can be
used for developing and testing methods to secure
anonymous communication, like Anonymouth [4].
• Stylometry research is useful not only for revealing
identities, but also author characteristics, like age, gender,
native language and personality type.
Applications
L
P < T Classify
family
P>T
Fi
Classify
language
Lij
L
Stylometry-Based Authentication
• An attacker may have user credentials
• Learn legitimate user’s writing style
• Record user activity and use stylometry to authenticate the
user is who s/he says s/he is
Malicious user
Legitimate
credentials
Train
Legitimate user
writing
6,67
90,25
25
4
Test
89,72
35
2,85
88,12
45
2,22
0
10
20
30
40
50
Accuracy (%)
JStylo-Writeprints-SMO
60
70
Random Chance
80
90
Lij
100
…I AM A MALICIOUS USER, BEWARE…
Download