Gramsci`s authorship attribution and anonymus newspapers articles

advertisement
Gramsci’s authorship attribution
of anonymus newspapers articles
Maurizio Lana
Histoire et informatique
Textométrie des sources historiques
6.6.2014
who we are
•
•
•
•
•
maurizio lana
mirko degli esposti
emanuele caglioti
dario benedetto
1 scholar and 3 physical mathematicians
it’s always data
• the analysis of numerization of physical world
phenomena can equally work on
• TAC imaging,
• songs,
• ECG,
• texts,
• …
reason for the study
• national edition of Gramsci’s works, by Ministero
dei Beni Culturali
• new work on the newspaper articles
• many anonymous newspaper articles in the
journals and newspapers Gramsci wrote for:
Il Grido del Popolo, Avanti!, La Città Futura
• request from the Fondazione Gramsci to start
anew the study of anonymous articles, to find
new evidences of Gramsci writings
• we were in 2005
a little background
• the start is in 1847, V.J. Bunjakovskij On the possibility to apply
determining measures of confidence to the results of some
observing sciences, particularly statistics
• 1897-98, W. Lutosławski, “On Stylometry”; “Principes de
stylometrie”
• 1959, D. R. Cox and L. Brandwood, On a discriminatory problem
connected with the works of Plato
• 1962, A. Ellegård, Who was Junius?
• 1964, F. Mosteller and D. Wallace Inference and Disputed
Authorship: The Federalist
• 1978, A. Kenny, The Aristotelian ethics: a study of the relationship
between the Eudemian and Nicomachean ethics of Aristotle
• 1980, J.P. Benzécri Pratique de l’analyse des données
• 1987, J. F Burrows, Word Patterns and Story Shapes: The Statistical
Analysis of Narrative Style, ”LLC”, 2, 1987, pagg. 61-70
in common…
• … they have the work at words levels
the turning point
• G. Ledger, Re-counting Plato: A Computer Analysis of
Plato’s Style, Oxford, Clarendon Press, 1989
• the scope are
words containing a specified letter;
words ending in a specified letter;
words with a specified letter as penultimate
• that is semantically and linguistically meaningless parts
of the words
• “I have departed from the traditional approach of
stylometry by ignoring entirely meanings and
grammatical functions, measuring instead the
frequencies of words according to their orthographic
content”
today, for me (for us)
• the key is:
a latent mathematical structure of the text
• from: L. Doležel, A note on quantification in
text theory, in: “Text Processing”, S. Allén ed.,
Stockholm, 1982, pagg. 539-552
• an expression of the idea: D. Khmelev, F.
Tweedie, Using Markov chains for
identification of writers, “LLC”, 16, 4, 2001,
pagg. 299-307
today, for me (for us)
• another expression: D. Benedetto, E. Caglioti,
V. Loreto et al., Language Trees and Zipping,
“Phys. Rev. Lett.” 88, n. 4, 048702-1, 048702-4
(2002)
• take 1 texts, compress it with Zip;
• then take another text and compress it with
the compression dictionary of the first one;
• measure the difference in size: this is the
measure of the relative entropy
then came the AAAC
• in 2004 the american mathematician Patrick Juola
proposed the ad-hoc authorship attribution
competition to experimentally find the best
method to correctly attribute anonymous works:
http://www.mathcs.
duq.edu/~juola/authorship_contest.html
• second best scorer Vlado Keselj, with a method
based on measurements of n-grams frequencies
the state of the QAA world in 2005
• in 2002 Jack Grieve, for his thesis
“Quantitative Authorship Attribution: A
History And An Evaluation Of Techniques”,
counts at least 39 known and used methods
with 93 variants for Quantitative AA
• the aim of AAAC: prune the useless methods
• nevertheless: this continue to be not science,
but craftmanship
in 2005 we started
• we had to prove to the Fondazione Gramsci
that the Quantitative AA produced good
results
• we choose to use two QAA methods:
– relative entropy (already described)
– n-gram distances (which gave Keselj the 2° palce
in the AAAC)
the protocol
• phase 1: 50 surely Gramscian texts; 50 surely
non-Gramscian texts;
– do whatever you like to be able to recognize the
Gramscian as Gramscian and the non-Gramscian
as non-Gramscian
• phase 2 (blind test): 40 unidentified texts,
some Gramscian and some not: classify them
correctly
text preparation
• deletion of:
– citations of any lenght
– proper nouns
– numbers
• no lemmatization: e.g. the choice for a given
tense and person of a verb contains some
quantity of information we cannot evaluate
properly in order to discard it
n-grams
• sequencies of n entities you must choose (we chose
characters)
• sliding n-grams: in “final” a 3-gram reads fin, ina, nal
• to find the right n you do tests
• n-grams capture fragments of meaning, syntax,
collocations/cooccurrences, etc.
• you have a dictionary of gramscian n-grams
• you check the n-grams of your anonymous texts; you
count the matches and the non-matches and do an
algebric sum: if positive the text is gramscian, if
negative not
strategy
• maximize the correct attributions
• at the same avoiding false attributions
• = some missed attributions are ok if you don’t
produce false attributions
• you must have your commissioner trust you
strategy 2
• we don’t know if, how, and how much the
“parole” of an author changes across matters,
audience, genre, time, …
• so we decide that we had to work on well
defined periods: their boundaries being left to
decide to the Gramsci experts
• 1° period 1914-1921
a little of maths
• having two methods at work, we could build a
cartesian plane, where the results of he
measures were plotted after normalization
bringing them in the range -1 / + 1
phase 1 - setup
phase 2 – blind test
the day after
• we started to do the attributions - being paid by
Fondazione Gramsci for it - without knowing anything
of the texts, and giving periodical reports to the
historians who were editors of the various volumes of
the national edition od Gramsci works
• we got the texts, normalized them, measured them,
and produced a Report we sent to Fondazione Gramsci
• historians evaluation of the QAA: no proposed
attribution was unacceptable, even if not every
proposed attribution was accepted
• [example of report]
now we have stopped
• due to the cuts to research funds, the national
edition is at now stopped
some practical principles on AA
• no tool can ‘read’ a text and say you: this text was
written by Francesco Stella
• you can only classify the texts you chose to work
on, crunched by the tool you use
• all of the texts will be connected: you must
interpret the results
• you must mix anonymous or disputed works with
“control works”: same period, same genre, same
language, same author, similar authors, …
be careful
• when you have proper nouns in your works,
it’s easy to classify them:
• R. Clement and D. Sharp, Ngram and Bayesian
Classification of Documents for Topic and
Authorship, “LLC”, 2003, 18(4):423-447
• but you don’t really classifiy the texts, you
classify the collections of proper nouns they
contain
why the gramsci cas was/is difficult
and strange
• articles are very short: between 300 and
1000/1200 words
• all of these articles share: matters, ideology,
context
• there is no countercheck, and you work for a
scientific and productive initiative (it’s not
‘simply’ an experiment)
• the tables showing the matches are sparse
tables, nevertheless these data work well
now what
• Patrick Juola, the mathematician who
proposed the AAAC, has released JGAAP, a
package offering various tools for QAA:
• http://evllabs.com/jgaap/w/index.php/
• the R package with stylo is impressive and I
wish we had it when we started our work with
Gramsci texts
some references to start from
• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, An
example of mathematical authorship attribution,
“Journal Of Mathematical Physics”, 2008, 49, pp. 1 – 20
• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti,
L'attribuzione dei testi gramsciani: metodi e modelli
matematici, “La Matematica nella Società e nella
Cultura”, 2010, 3, pp. 235 – 269
• M. Lana, Come scriveva Gramsci? Metodi matematici per
riconoscere scritti gramsciani anonimi, “Informatica
Umanistica”, 2010, 3, 31-56
some references (2)
• M. Lana, Individuare scritti gramsciani anonimi in un"
corpus" giornalistico. Il ruolo dei metodi quantitativi,
“Studi storici: rivista trimestrale dell'Istituto Gramsci”,
52 (4), 859-880
• P. Juola, Authorship Attribution, “Foundations and
Trends in Information Retrieval”, Vol. 1, No. 3 (2006)
233–334
http://www.conll.org/~walter/educational/material/
fnt-aa.pdf
• J. Grieve, Quantitative Authorship Attribution: An
Evaluation of Techniques, LLC 22: 251-270
http://dl.dropboxusercontent.com/u/99161057/Grie
ve_authorshipattribution.pdf
thanks!
Download