a Description of Ethan`s Slides

advertisement
Slide 9: Who’s words are in your bag of words?
Topic modeling means you’re dealing with words that have to be ONLINE, computerized. Who is
the actual editor behind the editions that you are dealing with online are not always so clear
(i.e., often times publication information is not provided on Project Gutenberg editions.
Slide 10: Harkness
We’re working with novels, and it seems like it probably wouldn’t matter whether a few words
were misspelled in a novel, and it often feels as though because there are more words, a few
mix-ups is a matter of percentages. However, it is agreed upon in the bibliographical community
(this article by Bruce Harkness is one representative) that textual scholarship regarding the
publication of novels is just as important.
Slide 11/12: Here is the issue at hand
In this diagram we’ve outlined the issue at hand – that something happens between the author
writing a work and that work ending up in our hands as a book or in an online edition a la
Gutenberg. The problem is that there are almost always mix-ups, errors, complications, or other
problematic occurences between this writing and getting something in our hands. This is a
problem even when we’re just looking at one edition – when we’re talking about doing this with
dozens, or hundreds of novels in a topic model like this, the problem is greatly compounded.
Slide 13: Overwhelmed
Don’t worry! And don’t feel like your whole model will be ruined by this. Even though this can
be a huge problem, we’re not sure how important it is to topic modeling given some of the
things topic modeling takes for granted.
Slide 14: Kinds of editions
In this image, G.T. Tanselle shows the kinds of editions that scholars can try to make – each has
different principles behind them. Some seek to recreate exactly a previous historical document,
all errors intact; others try to recreate an ideal eclectic text based on what they think of as the
author’s intention, introducing some changes that may not be present in any existing edition;
others don’t care about history at all, and editors/publishers introduce changes based on
something like aesthetic preference.
All of these are here for us to think about the principles behind editions and whose words you’re
getting in your bag of words: words from authors, publishers, modern editors, and so on. This
introduces a number of problems: how do the principles behind different editions stack up?
What if we are mixing in contemporary editorial words with historical authorial ones?
Frequently editorial methods do not mix with one another in terms of principle.
Slide 15/16: Case study of Ulysses Gabler edition –
The Gabler edition of Ulysses has a particularly thorny history, but this is a text we have in our
corpus. We have used the edition from Project Gutenberg – but as you can see, what was found
in our edition was very different form many other editions in one pivotal moment, adding
multiple sentences. So who’s words are these? Are they Joyce’s, or Gabler’s? Does it matter that
they were from a different manuscript? Do we care for the purposes of a model?
Slide 17: How much does all this matter?
We’re already “massaging” texts, adding stop words (another way of saying taking out words).
So who’s words are in the bag of words, or not in our bag of words? The insights topic modeling
come at a cost, one of which seems to be a careful consideration of the texts being used as
historical objects representative of the historical time in which they were written/edited
(something models are often used to think about).
Keep in mind, that this was just one example of one change in one edition of one novel in our
corpus… something to consider as you go about making your corpus and interpreting your
model!
Download