JStylo: An Authorship-Attribution Platform and its Applications Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques to extract linguistic features from documents, and supervised machine learning methods to classify those documents based on the extracted features. The platform feature extraction core is based on the JGAAP API [1], and the classifiers available include Weka [2] classifiers and an implementation of the Writeprints [3] classifier. Platform Overview 1. Problem Definition Training Documents [1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual Analysis, Text Categorization, and Authorship Attribution (2009) 2. Feature Selection [5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship Recognition Techniques (2009) Author 1 Author 2 Feature Set f1 f2 Learn Styles Suggest Changes Author N … fM f3 Document preprocessing Change Document Feature Extraction My docs Feature postprocessing NO Normalization Factoring Check if Anonymized 3. Classifiers Selection Document to Anonymize cL c1 c2 c3 YES Document Anonymized 4. Analysis Feature Extraction Training Documents Classification Results A1 A2 Document pre-process Training Set CV Results c1 … Train AN Test cL 12 289 5.61 … 13.7 Train Feature post-process ? ? ? A3 A15 1.2 5.78 5 Candidate languages A7 … Test … F1 F2 F3 … 41.1 A sample evaluation using the Writeprints feature set with Weka SMO SVM classifier on the Extended BrennanGreenstadt Adversarial corpus [5]: • 45 authors • > 6,500 words per author, divided into ~500-words documents • 10-fold cross-validation: 97,67 20 94,62 15 Candidate families Candidate languages F1 F2 F3 L11 L12 L13 Fi Classifier Result Li1 Li2 Li3 L21 L22 L23 L31 L32 L33 Classify language Evaluation 5 Personal Traits Identification: Native Language Using Language-Family Information • Classify documents by native language • Set the classification probabilities as threshold T • Use language-family reclassification for instances classified with probability p < T to improve language classification … Feature Extraction Test Documents Number of Authors [4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt, R.: Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization (2012) ? Feature [2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The Weka Data Mining Software: An Update (2009) [3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to IdentityLevel Identification and Similarity Detection in Cyberspace (2008) ? Document Anonymization Using Anonymouth [4] • JStylo as an authorship-attribution engine to evaluate anonymization level “Blend-in” Corpus Novelty References ? … Motivation • Cumulative feature-set analysis (vs. one feature at-a-time) • Added feature extractors and processing tools • Readability / complexity metrics • Regular-expression-based features • Counters (word / letter / regular expression) • High feature-level customizability • Factoring and Normalization • Uses Weka classifiers • Provides implementation of Writeprints Test Documents Author 1 Author 2 Author N Source: https://github.com/psal/JStylo-Anonymouth • Important for research in history, literature and forensics • Impact on privacy and anonymity in online environments: • Reveal identity: users can use various tools to hide their location, but their writing style may still be exposed. JStylo provides a convenient platform for developing methods to reveal anonymous identities. • Preserve anonymity: On the other hand JStylo can be used for developing and testing methods to secure anonymous communication, like Anonymouth [4]. • Stylometry research is useful not only for revealing identities, but also author characteristics, like age, gender, native language and personality type. Applications L P < T Classify family P>T Fi Classify language Lij L Stylometry-Based Authentication • An attacker may have user credentials • Learn legitimate user’s writing style • Record user activity and use stylometry to authenticate the user is who s/he says s/he is Malicious user Legitimate credentials Train Legitimate user writing 6,67 90,25 25 4 Test 89,72 35 2,85 88,12 45 2,22 0 10 20 30 40 50 Accuracy (%) JStylo-Writeprints-SMO 60 70 Random Chance 80 90 Lij 100 …I AM A MALICIOUS USER, BEWARE…