Day 3 Materials - Digital Humanities at Oxford

advertisement
Close, Distant, and Scalable Reading
Glenn Roe & Martin Wynne
Digital.Humanities@Oxford Summer School
July 10 2013
Close Reading
Close reading: "operates on the premise that
literature, as artifice, will be more fully understood
and appreciated to the extent that the nature and
interrelations of its parts are perceived, and that
that understanding will take the form of insight
into the theme of the work in question. This kind
of work must be done before you can begin to
appropriate any theoretical or specific literary
approach”.
Close Reading
Close reading: "operates on the premise that
literature, as artifice, will be more fully understood
and appreciated to the extent that the nature and
interrelations of its parts are perceived, and that
that understanding will take the form of insight
into the theme of the work in question. This kind
of work must be done before you can begin to
appropriate any theoretical or specific literary
approach”.
[A] finely detailed, very specific examination of a short poem or short selected passage from
a longer work, in order to find the focus or design of the work [...] the meaning of the
microcosm, containing or signaling the meaning of the macrocosm (the longer work of which
it is a part). To this end "close" reading calls attention to all dynamic tensions, polarities, or
problems in the imagery, style, literal content, diction, etc”.
http://theliterarylink.com/closereading.html
Close Reading as the paradigm for
text-based humanities scholarship
But what do you do with
a million books?
There are only about 30,000 days in a human life -- at a book a day, it would
take 30 lifetimes to read a million books and our research libraries contain
more than ten times that number. Only machines can read through the
400,000 books already publicly available for free download from the Open
Content Alliance.
-
Gregory Crane, “What do you do with a million books?”
D-Lib Magazine, March 2006
And 5 million books?
We constructed a corpus of digitized texts containing about 4% of all books ever printed.
Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey
the vast terrain of “culturomics” focusing on linguistic and cultural phenomena that were
reflected in the English language between 1800 and 2000. We show how this approach can
provide insights about fields as diverse as lexicography, the evolution of grammar, collective
memory, the adoption of technology, the pursuit of fame, censorship, and historical
epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a
wide array of new phenomena spanning the social sciences and the humanities.
www.sciencexpress.org / 16 December 2010
Culturomics…
Distant Reading
Distant reading: where distance, let me repeat it,
is a condition of knowledge: it allows you to focus
on units that are much smaller or much larger
than the text: devices, themes, tropes—or genres
and systems. And if, between the very small and
the very large, the text itself disappears, well, it
is one of those cases when one can justifiably
say, less is more. If we want to understand the
system in its entirety, we must accept losing
something. We always pay a price for theoretical
knowledge: reality is infinitely rich; concepts are
abstract, are poor. But it’s precisely this ‘poverty’
that makes it possible to handle them, and
therefore to know. This is why less is actually
more.
Franco Moretti, “Conjectures on World Literature”
Distant Reading, 2013.
Distant Reading
A canon of 200 novels, for instance, sounds
very large for 19th-century Britain (and is
much larger than the current one), but it still
less than %1 of the novels that were actually
published […] and close reading won’t help
here, a novel a day every day of the year
would take a century or so … And it’s not
even a matter of time, but of method: a field
this large cannot be understood by stitching
together separate bits of knowledge about
individual cases, because it isn’t a sum of
individual cases: it’s a collective system, that
should be grasped as such, as a whole.
Franco Moretti, Graphs, Maps, Trees: Abstract
Models for Literary History, 2005
Digital Humanities and
Distant Reading
The Humanities discovers data (DH 1.0  DH 2.0)
Quickly leads to a “data deluge” (ars longa, vita brevis)
Big Data approaches to Humanities collections (e-Research)
From accelerated research to new knowledge discovery
Digital Humanities and
Distant Reading
The Humanities discovers data (DH 1.0  DH 2.0)
Quickly leads to a “data deluge” (ars longa, vita brevis)
Big Data approaches to Humanities collections (e-Research)
From accelerated research to new knowledge discovery
digital
Digital Humanities and
Distant Reading
The Humanities discovers data (DH 1.0  DH 2.0)
Quickly leads to a “data deluge” (ars longa, vita brevis)
Big Data approaches to Humanities collections (e-Research)
From accelerated research to new knowledge discovery
digital > digitisation
Big Data and the Humanities
How Big is Big?
•
The Complete Works of Voltaire (Voltaire Foundation):
1,077 individual works, 6.7 million words
•
The Digital Encyclopédie of Diderot and d’Alembert (University of Chicago):
28 volumes in folio; 74,00 articles; 21.7 million words
•
Electronic Enlightenment (University of Oxford):
60,000 letters, 23 million words
•
ECCO-TCP (Oxford Text Archive):
2,300 volumes, 75 million words
•
ARTFL-Frantext (University of Chicago):
3,500 volumes, 215 million words
•
Early English Books Online EEBO (Northwestern University):
23,000 volumes, ~1 billion words
Matt Jockers,
University of
Nebraska-Lincoln
Macroanalysis:
Digital Methods and
Literary History
(UIUC Press, 2013)
Matt Jockers, Macroanalysis (2013).
Simon Raper, “Graphing the history of philosohy”
Distant Reading has a Long History:
Annales School, Book History, etc.
Counting, not reading:
• After death inventories
• Library holdings/circulation records
• Archives of publishers
• Vocabulary of titles (Furet)
• Censorship records
Martin, Furet, Darnton, Chartier, etc…
Robert Darnton, The
Forbidden Best-Sellers of
Pre-Revolutionary France
(New York, 1995), 189.
From “distant” (not)
reading to close
reading and back
again...
Digital Humanities
as a locus for
“scalable” reading
practices
DATA: digitally
assisted text
analysis
Martin Mueller,
Northwestern
Digital Humanities as locus
for “Scalable Reading”
By “not reading” we examine:
concordances,
frequency tables,
feature lists,
classifications,
collocation tables,
statistical models, networks, etc…
We can track:
Literary topoi (E.R. Curtius), concepts (R. Koselleck,
Begriffsgeschichte), épistémès (M. Foucault) and other semantic
patterns: over time, between categories, across genres.
So that distant reading and data-driven analysis can provide larger
contexts for close reading(s) and traditional scholarship.
Digital Humanities as locus
for “Scalable Reading”
Three primary areas of Digitally Assisted Text Analysis:
1. Computational/Corpus Linguistics
2. Information Retrieval
3. Text Mining and Data Visualization
Corpus Linguistics and
Scalable Reading
Corpus
Concordance
Collocation
Sinclair, John, Corpus, Concordance, Collocation, Oxford University Press, 1991
Some testable assertions
State

“...no political writer before the middle of the sixteenth century used the word 'state' in
anything like its modern political sense [referring to the machinery of government and social
control]” (Skinner, Quentin, The Foundations of Modern Political Thought, Cambridge
University Press, 1978).
Tudor

“The idea of a "Tudor era" in history is a misleading invention, claims an Oxford University
historian. Cliff Davies says his research shows the term "Tudor" was barely ever used during
the time of Tudor monarchs.” (http://www.bbc.co.uk/news/education-18240901 May 2012)
Holocaust

“I will argue that “The Holocaust” is an ideological representation of the Nazi
holocaust...Until recently, however, the Nazi holocaust barely figured in American life.
Between the end of World War II and the late 60s, only a handful of books and films touched
on the subject”. (Norman Finkelstein, The Holocaust Industry. Verso, 2000.)
A new opportunity
“It is not easy to justify assertions about the alleged frequency of
infrequency of some particular belief or attitude in the past. How many
examples does one need to cite in order to prove the point? Lacking
any satisfactory method of quantifying these matters, all I can do is to
record my impressions after long immersion in the period”.
Keith Thomas, The Ends of Life, Oxford University Press, 2010.
Intellectual History
“We cannot hope to understand the behaviour of people long
dead, unless we can reconstruct the mental assumptions
which led them to act as they did.”
- Keith Thomas, The Ends of Life, Oxford University Press, 2010.
Evidence:

Writing
 Speech
 Thoughts
 Actions
 Artefacts (art, architecture, cooking, etc.)
 Other?
An objection (or two)
Isn't this just Googling stuff?
or
Isn't it just looking up words in online text
collections?
The perils of interpretation…
How do we interpret the results? We need to ask the questions:

What's in my corpus?
What's missing from the population of texts which the corpus is sampled
from?
What claims can I make about results from this dataset?
What is the right tool for the job?
Will I successfully retrieve all occurrences of the word forms which I am
looking for?
How can I make my search term more sophisticated?
What claims can I make about the significance of the frequencies?

How can I improve the process, and refine the results?

What do I need to investigate further?






Distant > Scalable Reading
DH Research and Development:
Full text search/retrieval
Tool development
Text mining approaches
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Open source full-text search and analysis system based
on traditional models of humanistic textual scholarship.
Used worldwide by a number of teams independently of
its French roots:
 Perseus under PhiloLogic - Greek and Latin Library
 The École des Chartes in Paris - medieval charters, etc.
 Brown Women Writers Projects - heavy TEI encoding -(Early Modern Women's Studies and The Scholarly
Technology Group of Brown University)
Information Retrieval:
PhiloLogic search engine
 Maison de Balzac in Paris (scholarly on-line edition of
Balzac's Comédie humaine)
 Abraham Lincoln Digitization Project at Northern Illinois
University
 Indica et Buddhica - Sanskrit texts compiled by an
Independent scholar in New Zealand
 Alexander Street Press, a commercial on-line publisher.
Many collections of large data sets, including a large
collection of Black drama (about 1,200 plays)
Information Retrieval:
PhiloLogic search engine
PhiloLogic3's general features include:
Word and phrase searching:
 Proximity searches in sentences and paragraphs.
 Similarity searches - fuzzy matching (wildcards*)
Corpus definition using rich metadata at the document and subdocument level (Author, Title, Dates, Genre, etc.)
A variety of advanced reporting features:
 Concordances
 KWICS (Keyword in Context)
 Frequency distributions per period/work/author, etc.
 Collocations and collocation tables
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
Information Retrieval:
PhiloLogic search engine
"From words to works":
Extensions to PhiloLogic
PhiloMine: machine learning & text mining package
Open Source: http://code.google.com/p/philomine/
PhiloLine/PAIR: sequence alignment algorithms for text
comparison
Open Source: http://code.google.com/p/text-pair/
Different Similarities,
Different Searches
•
Computing similarity is what enables search/retreival
•
Different kinds of similarity, different levels of text objects,
different kinds of search
•
PhiloLogic finds and analyses word occurrences
•
PhiloMine compares texts using word vectors to find topical
or stylistic similarity
•
PAIR sequence alignment compares texts using ordered
sequences of words to identify text reuse
Styles of Search
Standard search
•
•
Find all occurrences of a word (PhiloLogic)
Find webpages about a word or concept (Google)
Comparison queries (PhiloMine)
•
•
•
Find differences between sets of documents
Find mislabeled documents
Find similar documents
"Unsupervised" search
•
•
•
Segment a document into topical chunks
Cluster documents into cohesive groups (PhiloMine)
Find repeated text in a corpus (PAIR)
Distant reading at close range
or close reading at a distance
Distant reading at close range:
Voyant Tools
Distant reading at close range
or close reading at a distance
Martin’s slides: some examples from Voyant, bring it
all together?
Close/distant/scalable reading
corpus linguistics
info retrieval (full-text search/analysis)
text mining and data viz.
DATA – Digitally Assisted Text Analysis
Tools for scalable reading
Download