here - DIMACS REU

advertisement
Determining Common
Authorship Among
Documents
Paul Bonamy
Mentor: Dr. Paul Kantor
Author Identification &
Common Authorship







Author Identification: “Who wrote this?”
Mosteller/Wallace, 1964 – The Federalist
12 disputed papers attributed to Madison
Generally utilizes statistical analysis
Common Authorship: “Do these share an author?”
Does not (necessarily) require statistics/training
Useful for detecting forgeries, etc
BMR/BXR





Implements Bayesian Multinomial Regression
Used to perform 1-of-k classification
BMRtrain accepts feature vectors, outputs
assignment model
BMRclassify accepts model & vectors,
outputs assignments
Can output author probability vectors
Bayesian Analysis
Bayes' Theorem
P( E | H 0 ) P( H 0 )
P( H 0 | E ) 
P( E )


Consider two match boxes
Probability of Box 1, given black marble?

H0 = We have Box 1, E = We see a black marble
P( E | H 0 )  .5, P( H 0 )  .5, P( E )  .75
.5(.5) .25 1
P( H 0 | E ) 


.75
.75 3
Bayesian Analysis in BMR

Bayes’ Theorem Extendable to P(C|F1…FN)



C is a class
F1…FN are features
Effectively applies Bayes’ Theorem to itself
BMR/BXR Workflow
Data
( Doc Corpus)
Test/Train
Splitter
Training Set
Testing Set
Feature Extractor
Feature Vectors
BMRtrain
Model
BMRclassify
Author
Identification
Author
Probabilities
Corpus Construction



Articles from 2006-07 issues of The Compass Newspaper
16 Authors
130 Documents


300 - 500 Words: 69
500+ Words: 61

Varied Topics

On Friday, November 3, LSSU experienced its first closing of the
semester due to inclement weather. The Soo Evening News
reported a “number of minor mishaps,” and “slippery-road induced
mishaps,” including two crashes near the campus of LSSU. All
classes before 10 AM were canceled because of the snow and ice
that had accumulated overnight, but many students arrived for
classes as usual, unaware of the cancellation. …
Feature Extraction


Perl script using Lingua::EN::Tagger
Selects words, part-of-speech (POS), or both
(wordPOS)





address/VB
address/NN
Used wordPOS in common authorship study
Returns vector of feature frequencies
4:9.0 16:5.0 22:4.0 23:2.0 28:5.0 29:1.0 33:4.0
36:9.0 38:1.0 41:3.0 46:13.0 56:2.0 …
Author Probability Vectors




Produced by BMR/BXR upon request
Probability doc belongs to each author in the
training set
Not normalized (sum not necessarily 1)
0.17% 0.68% 9.13% 8.90% 2.42% 0.94%
10.55% 0.32% 0.72% 36.95% 0.31% 0.50%
0.48% 22.08% 1.34% 4.52%
Computed With Features




Start with feature vectors
Select all distinct pairs of vectors
Compute dot product and Euclidean distance
Sort data


Descending by dot product
Ascending by Euclidean distance
Computed With Authors




Start with author probability vectors
Select all distinct pairs of vectors
Compute dot product and Euclidean distance
Sort data


Descending by dot product
Ascending by Euclidean distance
What Are We Looking For?

DP and Euclidean distance measure distance



Computed distances between vectors
Sorted from closest to furthest
Docs by same author are close together

Docs by different authors far apart
Same Auth? Doc # Auth # Doc # Auth #
DP Euclid
1
5
2
6
2 0.756 28.302
0
2
0
27
9 0.702 30.116
0
5
2
32
13 0.711 30.133
1
32
13
33
13 0.771 30.381
0
6
2
32
13 0.729 30.708
ROC Curve


Shows fractions of not-pairs
versus fraction of pairs
Area under curve indicates
model accuracy



Higher is better
Euclidean distance of
feature vector
This curve: 64.7% of area
under curve
Can We Improve This?
Euclid Dot
Features 64.7%
Authors
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6%
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6% 83.3%
Can We Improve This?
Euclid Dot
Features 64.7% 65.2%
Authors 78.6% 83.3%
Results for Other Data Splits
Analysis vs. Area Under ROC Curve
100.0%
Area Under ROC Curve
95.0%
90.0%
85.0%
80.0%
75.0%
70.0%
65.0%
60.0%
Features Euclid
Features DP
Author Euclid
Author DP
33.33% Accurate
73.5%
69.9%
95.1%
95.3%
38.10% Accurate
77.8%
65.7%
69.9%
75.2%
56.40% Accurate
64.7%
65.2%
78.6%
83.3%
80.00% Accurate
65.0%
77.0%
88.3%
92.0%
Analysis Type
Analyzing Other Corpora

Obtained second
corpus



9377 Documents
24 Authors
Results similar to those
on Compass dataset
Euclid Dot
Features 55.2% 59.5%
Authors
79.7% 84.5%
Open Questions


Are Area Under Curve variations significant?
How does Author ID model accuracy affect
same-author accuracy?


A low Author-ID accuracy model did very well
Can we reduce memory/processing
requirements?
Download