here - DIMACS REU

Determining Common Authorship Among Documents Paul Bonamy Mentor: Dr. Paul Kantor Author Identification & Common Authorship        Author Identification: “Who wrote this?” Mosteller/Wallace, 1964 – The Federalist 12 disputed papers attributed to Madison Generally utilizes statistical analysis Common Authorship: “Do these share an author?” Does not (necessarily) require statistics/training Useful for detecting forgeries, etc BMR/BXR      Implements Bayesian Multinomial Regression Used to perform 1-of-k classification BMRtrain accepts feature vectors, outputs assignment model BMRclassify accepts model & vectors, outputs assignments Can output author probability vectors Bayesian Analysis Bayes' Theorem P( E | H 0 ) P( H 0 ) P( H 0 | E )  P( E )   Consider two match boxes Probability of Box 1, given black marble?  H0 = We have Box 1, E = We see a black marble P( E | H 0 )  .5, P( H 0 )  .5, P( E )  .75 .5(.5) .25 1 P( H 0 | E )    .75 .75 3 Bayesian Analysis in BMR  Bayes’ Theorem Extendable to P(C|F1…FN)    C is a class F1…FN are features Effectively applies Bayes’ Theorem to itself BMR/BXR Workflow Data ( Doc Corpus) Test/Train Splitter Training Set Testing Set Feature Extractor Feature Vectors BMRtrain Model BMRclassify Author Identification Author Probabilities Corpus Construction    Articles from 2006-07 issues of The Compass Newspaper 16 Authors 130 Documents   300 - 500 Words: 69 500+ Words: 61  Varied Topics  On Friday, November 3, LSSU experienced its first closing of the semester due to inclement weather. The Soo Evening News reported a “number of minor mishaps,” and “slippery-road induced mishaps,” including two crashes near the campus of LSSU. All classes before 10 AM were canceled because of the snow and ice that had accumulated overnight, but many students arrived for classes as usual, unaware of the cancellation. … Feature Extraction   Perl script using Lingua::EN::Tagger Selects words, part-of-speech (POS), or both (wordPOS)      address/VB address/NN Used wordPOS in common authorship study Returns vector of feature frequencies 4:9.0 16:5.0 22:4.0 23:2.0 28:5.0 29:1.0 33:4.0 36:9.0 38:1.0 41:3.0 46:13.0 56:2.0 … Author Probability Vectors     Produced by BMR/BXR upon request Probability doc belongs to each author in the training set Not normalized (sum not necessarily 1) 0.17% 0.68% 9.13% 8.90% 2.42% 0.94% 10.55% 0.32% 0.72% 36.95% 0.31% 0.50% 0.48% 22.08% 1.34% 4.52% Computed With Features     Start with feature vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data   Descending by dot product Ascending by Euclidean distance Computed With Authors     Start with author probability vectors Select all distinct pairs of vectors Compute dot product and Euclidean distance Sort data   Descending by dot product Ascending by Euclidean distance What Are We Looking For?  DP and Euclidean distance measure distance    Computed distances between vectors Sorted from closest to furthest Docs by same author are close together  Docs by different authors far apart Same Auth? Doc # Auth # Doc # Auth # DP Euclid 1 5 2 6 2 0.756 28.302 0 2 0 27 9 0.702 30.116 0 5 2 32 13 0.711 30.133 1 32 13 33 13 0.771 30.381 0 6 2 32 13 0.729 30.708 ROC Curve   Shows fractions of not-pairs versus fraction of pairs Area under curve indicates model accuracy    Higher is better Euclidean distance of feature vector This curve: 64.7% of area under curve Can We Improve This? Euclid Dot Features 64.7% Authors Can We Improve This? Euclid Dot Features 64.7% 65.2% Authors Can We Improve This? Euclid Dot Features 64.7% 65.2% Authors 78.6% Can We Improve This? Euclid Dot Features 64.7% 65.2% Authors 78.6% 83.3% Can We Improve This? Euclid Dot Features 64.7% 65.2% Authors 78.6% 83.3% Results for Other Data Splits Analysis vs. Area Under ROC Curve 100.0% Area Under ROC Curve 95.0% 90.0% 85.0% 80.0% 75.0% 70.0% 65.0% 60.0% Features Euclid Features DP Author Euclid Author DP 33.33% Accurate 73.5% 69.9% 95.1% 95.3% 38.10% Accurate 77.8% 65.7% 69.9% 75.2% 56.40% Accurate 64.7% 65.2% 78.6% 83.3% 80.00% Accurate 65.0% 77.0% 88.3% 92.0% Analysis Type Analyzing Other Corpora  Obtained second corpus    9377 Documents 24 Authors Results similar to those on Compass dataset Euclid Dot Features 55.2% 59.5% Authors 79.7% 84.5% Open Questions   Are Area Under Curve variations significant? How does Author ID model accuracy affect same-author accuracy?   A low Author-ID accuracy model did very well Can we reduce memory/processing requirements?

here - DIMACS REU

Related documents

Products

Support

here - DIMACS REU

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib