Stylometry - Seidenberg School of Computer Science and

advertisement
CSIS
Stylometry
Projects, mostly Fall 2009 Project
Seidenberg School of Computer Science and Information Systems
Stylometry System
CSIS
Description of Project
Stylometry - is the study of the unique linguistic styles
and writing behaviors of individuals in order to
determine authorship
•
Part I
–
•
Search to determine an interesting and unique
application of stylometry for Research
Part II
–
Feasibility study on existing tools/applications for
email authorship (250 words or less)
Stylometry System
CSIS
Existing / Potential Uses of Stylometry
•
Music Lyrics
•
Plagiarism
•
Music Melody
•
Social Networking
•
Paintings
•
Electronic Mail
•
Literary Works
•
Instant Messaging
•
Forensic Linguistics
- Social networking, electronic mail, and instant
messaging are still in early stages of study
Stylometry System
CSIS
Use Cases
-
Twitter
-
-
Used to verify existing Twitter accounts and help
mitigate impersonations
Electronic mail
-
-
Implemented in a corporate setting helping identify
anonymous emails meant to do harm
Chat
-
Assist in determining authorship of instant messages
Similar to Twitter but needs to be dynamic
Stylometry System
CSIS
Use Cases
-
Terrorism
-
Help identify an author of terrorist content or identify
terrorist content by using contextual analysis
Applied to blogs, forums, wikis, email, chat and other
forms of digital content
Stylometry System
CSIS
Tools discovered
-
JGAAP (Java Graphical Authorship Attribute
Program)
Signature Tool
C# Tool
StyleTool
Blog stylometry tool
Stylometry tool
Stylometry System
CSIS
Tools discovered
-
JGAAP (Java Graphical Authorship Attribute
Program)
-
Java based tool
Runs on Windows and Linux
Identification tool
-
-
1 of n decision – Many known email authors trying to
determine the author of one unknown email
One unknown email author compared to 99 known
email authors
100 total tests run
Stylometry System
CSIS
Tools discovered
-
C# Tool
-
Written in C programming language
Developed by prior Pace CS graduate students
Identification tool
-
-
1 of n decision – Many known email authors trying to
determine the author of one unknown email
One unknown email author compared to 99 known
email authors
100 total tests run
Stylometry System
CSIS
Tools discovered
-
Signature Tool
-
Written in C programming language (not confirmed)
Created by Peter Millican from Hartford College
Authentication Tool
-
-
-
Either match / no match
Match testing – 9 known and 1 unknown sample
(same author)
No Match – 10 known and 1 unknown (two different
authors)
Total of 105 tests were run
Stylometry System
CSIS
Testing methodology
-
Each team member submitted 20 (or 30) actual
emails from 2 (3) different authors.
-
-
Total of 100 emails collected from 10 different authors
Removed from native program and saved as text files
Average size (words) of email 195.7
Different testing for identification and
authentication tools
-
For authentication tool
-
False Accept Rate - Rate a document is falsely attributed to an author
False Reject Rate - Rate a document is not correctly attributed to an
author
Stylometry System
CSIS
Testing Results
JGAAP (Levenshtein Distance algorithm)
Canonizers
Words
On
Off
50%
30%
C# Tool Match Test
Accuracy
57%
Word Length
50%
30%
Characters
60%
40%
Syllables per Word
40%
30%
Word Bigrams
70%
60%
Signature Tool Match Test
Categorizing the result based
on the country of the author
Match
Events
Accuracy
FRR
Word Length
53.33%
Letters
46.67%
Tool
India
USA
India
USA
46.67%
JGAAP
50%
100%
NA
NA
53.33%
Signature
61.11%
75.00%
81.48%
83.33%
C# Tool
42%
80.00%
NA
NA
Signature Tool No-Match Test
Events
No-Match
Accuracy
FAR
Word Length
53.33%
46.67%
Letters
82.22%
17.78%
Stylometry System
CSIS
Earlier Study’s Features – 20 of 55
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
1. Number of sentences beginning with upper case
2. Number of sentences beginning with lower case
3. Number of Words
4. Average Word Length
5. Number of Sentences
6. Average Number of Words per Sentence
7. Number of Paragraphs
8. Average Number of words per Paragraph
9. Number of Exclamation Marks
10. Number of Number Signs
11. Number of Dollar Signs
12. Number of Ampersands
13. Number of Percent Signs
14. Number of Apostrophes
15. Number of Left parentheses
16. Number of Right parentheses
17. Number of Asterisks
18. Number of Plus Signs
19. Number of Commas
20. Number of Dashes
Stylometry System
CSIS
Conclusion
-
-
Overall the moderate accuracy of the test results
suggest that none of the tools evaluated are
capable of accurate stylometric email author
identification
Categorizing email samples by country of origin
seems to yield better accuracy results for all three
tools tested.
Stylometry System
CSIS
Recommendations
-
Further testing and research using email from
authors of different countries
Continue to refine and add to the stylistic feature
set created by prior Pace graduate students
-
Include new features becoming more prevalent in
digital content. Ex. Emoticons, hyperlinks
-
-
Internet slang – BRB, LOL, TTYL
Consideration for people who wish to disguise
their identity needs to be addressed and
researched further
Stylometry System
Download