Report on the Digital Humanities 2011 conference (DH2011)

advertisement
Report on the Digital Humanities 2011 conference (DH2011)
General Information
Digital Humanities 2011, the annual international conference of the Alliance of Digital
Humanities Organisations (ADHO), was hosted by Stanford University Library from 19th to
23rd June. The conference featured opening and closing plenary talks given by David Rumsey
and JB Michel & Erez Lieberman-Aiden respectively, the Zampolli Prize Lecture given by
Chad Gaffield, three days of academic programme with four strands each, a large poster
session, as well as a complementary social programme. The highlights of the latter included a
conference banquet at the Computer History Museum in Mountain View, and excursions to the
Silicon Valley, the Sonoma Wine Country, and Literary San Francisco. The conference website
is at <https://dh2011.stanford.edu/> from which the conference abstracts can be downloaded.
The conference’s twitter hashtag is #dh11. The following is a summary of the sessions
attended.
Opening keynote
“Reading Historical Maps Digitally: How Spatial Technologies Can Enable Close, Distant
and Dynamic Interpretations” - David Rumsey, Cartography Associates
David Rumsey noted that the conference programme featured a number of spatial indicators in
many of the presentations, which demonstrates a cross-disciplinary interest in spatial
dimensions and associated topics, such as exploration and visualization. Maps are extremely
complex spatial constructs that combine text and visualization, and which challenge us as
particularly dense information systems. A multitude of visualization approaches and tools are
used to “read” and unlock more removed historical maps. Maps contain historical, cultural, and
political information, but are also objects of art with their own visual idiosyncracies and
complexities. Historical maps have been intensively used in teaching and scholarship for a long
time, the large number of requests for reproductions Rumsey’s website
<http://www.davidrumsey.com/> receives testifies to the interest from a multitude of subjects.
Reading both the image and the text using digital tools is key to understanding historical maps.
These can be close, distant, and dynamic readings. For the purpose of close reading, scholars
have used a variety of tools as well as facsimiles and annotations as visualizations of aspects of
maps for centuries. Modern technology can help us to visualize and layer these notes more
effectively. Distant reading is facilitated by zooming out of individual maps and focussing on
the multitude of representations of the same geospatial zone, e.g. by looking at hundreds of
thumbnails at once as facilitated by the Luna Image library. Luna provides access to a large
maps database, offers advanced image compression, quick overviews, collection building,
atlases, full views of maps, and improved image manipulation. The coverage of individual
maps can be visualized on top of larger maps to give a quick overview for a large number of
maps. At the same time, additional serendipity is offered via the website’s ticker feature with a
random selection of maps with just a minimal metadata record. Dynamic reading, finally,
happens in GIS information systems, which facilitate three-dimensional visualizations. GIS
1
tools for historical maps that enable geo-referencing are key to reading historical maps
dynamically. Thus older maps can be displayed alongside newer maps over long periods of
time, altitudes can be highlighted in 3D, maps can be joined together to reflect political,
cultural, or historical areas. Precise overlaying of maps is another powerful tool for researchers
that helps with analysis and interpretation. In the near future, the automated pixel-based
extraction of information from maps, the OCR’ing of maps, will be a powerful tool to help
identify places, produce indexes, normalize information on disparate maps and align
information with historical gazetteers. Dynamic reading thus combines close and distant
reading, overlays help the thinking process, and 3D helps with a visually centred interpretation
of historical maps. Virtual worlds also offer additional possibilities, e.g. SecondLife or
OpenSim, which offer potentially infinite three-dimensional space that can be used to lay out
large maps, or to fly through the large areas covered by maps. They also allow for the creation
of a “tower” of thumbnails that offers a quick overview of large amounts of maps. However,
some maps are not visualizations of spaces, instead they offer symbolic gateways to space: to
be able to visualize these symbolic information systems is a fascinating new avenue and new
ways of understanding may thus evolve. The challenges ahead include effective patternrecognition in maps, the move of GIS to Web-based services, and the potential chances and
challenges of crowd-sourced GIS.
Session one
“Layer upon layer: computational ‘archaeology’ in 15th century Middle Dutch
historiography” – Rombert Stapel, Fryske Akademy (KNAW) / Leiden University,
Netherlands
Texts in the middle ages are characterized by a number of interfering features, such as
transmission through copyists. Few original works have actually been handed down. It is
difficult for scholars to “read through” these distancing layers, but some progress can be made
with computational techniques. This paper focuses on the Teutonic Order’s chronicles,
Croniken van der Duytscher Oirden, a late 15th century chronicle on the history of the Teutonic
order. The chronicles cover the time from biblical origins to the thirteen years’ war against the
Polish King and Prussian orders. A new manuscript has recently been identified at Vienna, and
it is considered an extremely rare autograph of Hendrik Gerardz Van Vianen, who is the author
of a relatively large amount of texts and a well documented figure. An interesting question is
whether he was the author or a compiler of different sources, although in the Middle Ages this
distinction didn’t exist or was not strictly observed. Both roles frequently exist in one text. We
know that the prologue makes a claim to have been written by the Bishop of Paderborn who
was present at the foundation of the order, although this is unlikely. Similarly, the end is likely
to have been added by a different scribe. Original composition is therefore a difficult concept,
as is the “intended audience”. Traditional methods have helped to identify Hendrik as one of
the authors, but for some parts this is more difficult to determine. The use of computational
techniques, in the form of free, easy-to-use and quick-to-learn tools, offers a new avenue into
the text. Burrows’ Delta method of stylistic difference, a leading method of authorship
attribution based on word frequency analysis, has been used for this purpose. The base text was
encoded in TEI/XML, which will also be used for creating an edition of the text. For the testing
of the Delta method, the narrative was sampled according to the expectations and knowledge
2
of the text and then tested according to those hermeneutic assumptions. The Delta method was
found to be working well with a set of samples that was to be verified. Following this
successful run, the method was then set up and launched on different primary samples. It was
verified that parts of the prologue are indeed not in Hendrik’s hand, and also that the ending is
not in the original writer’s hand. As a result, it can be summarized that both the prologue and
the ending must have pre-existed the narrative of Hendrik in the Croniken. Both texts must
have existed in the near temporal and spatial vicinity of the original composition and have been
absorbed into the Croniken.
“The UCLA Encyclopedia of Egyptology: Lessons Learned” - Willeke Wendrich, UCLA
The UCLA Encyclopedia of Egyptology is a 5-year, NEH-funded project that is producing a
resource, primarily intended for undergraduate students, that helps with finding and using
quality scholarly resources in the field of Egyptology. The resource exists in two versions: an
open one, published in the UCLA Escholarship online system
<http://escholarship.org/uc/nelc_uee>, which contains all the in-depth information about the
subject, and a full version, which in addition includes marked-up texts, maps, and many
advanced features <http://www.uee.ucla.edu/>. The full version offers a universal system of
accessing information about Egyptology, the discipline, and its constituting scholarship. Each
article has a map associated with it that shows the places mentioned and any images available.
The map can also be used as a browsing tool to get into the subject. It always links back to the
scholarly resources available for a particular place or region. The project really comprises two
projects: an editorial project and an online publishing project. Google Docs is used as a
common editing tool for articles, Skype is used for meetings, and a review component is also
available for the whole acceptance process for new resources: everything is peer-reviewed. For
the open version the workflow involves the submission of an article, its review, revision, copyediting and finally online publication. For the full version of the Website, each article is also
encoded in TEI/XML and is made available in additional ways, e.g. via the map. The
Encyclopedia's bibliography and place name database are the biggest constituent parts of the
project. The full version of the project also makes regions, sites, and features available for
browsing. Additionally, there are three timelines, which record the creation, modification, and
occupation of places or regions. There have been several lessons learned: sustainability is a
huge issue for a large project like the Encyclopedia both in terms of preservation and of access.
Financial sustainability is also a major issue. The open version is free, however, the full
version will be subscription-based to cover the ongoing costs of development, the editorial
process, markup and metadata, and the use of Google maps, as well as marketing and
subscriptions. A number of possible models are currently being discussed, such as trying to
establish a permanent endowment, a subscription-based model, or advertisements as a revenue
source.
“Possible Worlds: Authorial markup and digital scholarship” - Julia Flanders, WWP, Brown
University
There has always been a conceptual tension between two traditions of markup: editorial
markup vs. the markup of meaning (authorial markup). These traditions have different textual
commitments and enact different relationships between the text, the encoder, and the reader.
3
Mimetic (editorial) markup is about getting the text as an artefact to the reader. The authorial
markup model is more concerned with transporting interpretation as a scholarly tool. The
performative quality of markup is highlighted as a way of conveying meaning. The rhetoric of
authorial markup, its transactions between the encoder and his reader, revolves around the
communication of the encoder’s ideas and beliefs about a text as a means of interpretative
work, for which the text is primarily the carrier and reference. In the TEI/XML context, @ana
and @interp are important attributes for enacting this performance. TEI/XML is here used as
an authoring system. The authorial markup tradition often cites a journal article encoded in
TEI/XML as an example, where markup brings textual features into being. Authorial markup
captures the primary text and the interpretative reading. The text with authorial markup reveals
the performance of the critical reading. The result is a composite text that is no longer mimetic,
but performative and dynamic. Nuanced semantic differences are thus revealed: markup
becomes a world-constructing means of expression. This is a departure from the claim of
mimetic markup to truth. Meaning is no longer in the text alone, it evolves with the
performance of the authorial markup, in the reading of the text itself, not in its materiality.
Markup expresses the critical idea, an interpretative space for the encoder to play out his
performance. This type of markup enables speculative approaches to the encoding of text, e.g.
encoding a narrative passage as a poem. Taking away expectations and presumptions may
allow us to see different feature patterns in texts. This is sometimes called counter-factual or
“possible worlds” markup. The editorial and critical debate surrounding a text is another form
of markup, e.g the genetic encoding developments in the TEI community, which allows for a
discussion in the markup of the text. Genre in this form of markup is sometimes reconsidered.
Texts might be formalized in completely different ways based on our readings of the text.
There are three important discussions: firstly, suggestions of a reading brings a world into
being; secondly, it creates a conceptual space for the performance, e.g. certain readings might
be elevated to schemas away from the actual text, which becomes an instantiation; thirdly,
ODDs might be seen as a way of discussing the relationships between different schemas and
the ways they can work on the texts that have been included as instantiations of these schemas.
The problems/questions remaining are: can “possible-worlds” markup express the freedom it
seeks without sacrificing the truth-value of these “possible worlds”; how can meaning in the
markup be conveyed to the reader; how can readings be made more accessible to the reader as
possible interpretations of the text?
Session two
“Expressive power of markup languages and graph structures” - Yves Marcoux, Université de
Montréal, Canada
This paper addresses the possibilities of overcoming certain shortcomings of XML, such as
non-hierarchical (overlapping) structures, to increase its expressive powers for use in the
digital humanities. The structure of marked-up documents can be represented in the form of
graphs (DOMs). An XML document conforms to the subclass of graphs known as trees. There
is thus a perfect correspondence in the sense that any tree is an XML document. It is clear,
however, that we need more complex structures than trees. Information is not always
hierarchical: the verse and sentence structures in a poem, for example, do not necessarily nest
properly and to host both overlapping structures in the same document is not currently possible
4
in pure XML. The same is true for speech and line structures, which often requires recording
discontinuity. The general problem is the application of multiple structures to the same content.
While these are problems that cannot be easily expressed in XML, they also make it possible to
think of different ways of encoding the same instantiation. There have been different proposed
solutions: stay in XML, but manage issues in XML applications, e.g. TEI, or extend XML
itself as in TexMECS, an XML-like markup language allowing overlapping elements and other
constructs. The presenter has investigated the options offered solely by the overlap mechanism
of TexMECS, i.e. overlap-only TexMECS or OO-TexMECS, which accommodates documents
which allow for multiple parenthood and do not require a total ordering on leaf nodes. This
only slightly extended version of TexMECS makes it possible to express child-ordered directed
graphs (CODGs) that have only got a single root, but is unable to express multiple-root ones.
Thus is has been shown that just adding overlap to XML doesn't allow for the more complex
scenarios and structures which prompted the investigation into a more expressive form of
XML. Some future work will include finding optimal verification of serializations of
documents into OO-TexMECS, optimal serialization of CODGs, graphs with partially ordered
children, other constructs of TexMECS not investigated here.
“Mining language resources from institutional repositories” - Gary F. Simons, SIL
International and Graduate Institute of Applied Linguistics
Language resources (texts, recordings, research, dictionaries, grammars, tools, standards) are
fundamentally important for any linguist, but they are not always easily discoverable or
accessible. This paper presents work done in the context of the Open Language Archives
Community (OLAC) <http://www.language-archives.org>, which was founded in 2000.
The “OLAC: Accessing the Worlds’ Language Resources” project has produced metadata
usage guidelines, an infrastructure of text mining methods, and a set of best practices for the
language corpora creating community. It has created the OLAC Language Resource Catalogue,
which makes the world’s language resources discoverable. The problem is that there is no
conventional search mechanism that can find all the language resources out there, particularly
those hidden in the deep web or resources whose languages are not uniquely identified by
names. One systematic solution is to focus on accessing institutional repositories, in which
universities now preserve their language resources. The method employed consists of finding a
description of a language resource, extract the languages represented therein, and use
harvesting to retrieve the metadata about these resources. MALLET (MAchine Learning for
LanguagE Toolkit) was used to train a classifier to identify the language resources, the sample
set was taken from the LoC catalogue, then a Python function was used to extract the names of
the languages using the ISO 639-3 codes from the LoC subject headings, and normalize these
lists using controlled vocabularies and stop lists. The OAI harvester was given 459 resulting
URLs, and the harvest yielded 5 million DC metadata records, which were then piped through
the classifier and extractor, which left roughly 70,000 potential language resources. But the
question remained which of these should be entered into the OLAC catalogue? An additional
language code identification filter was run on the results, which left about 23,000 records that
were considered genuine language resources. This automated process resulted in a 79%
accuracy rate of the algorithm for language extraction, and 72% precision for language
identification. However, a number of known problems remain: non-English metadata, names
used as adjectives of ethnicity of place, language names as place names, missing words from
stop lists, incorrect weighting heuristic, incomplete metadata records where languages are not
5
explicitly identified. As a result of the work, 22,165 language resources were identified with
acceptable precision, but more work to tweak algorithms remains to be done.
“Integration of Distributed Text Resources by Using Schema Matching Techniques” - Thomas
Eckart, Natural Language Processing Group, Institute of Computer Science, University of
Leipzig
This investigation builds on the text mining project eAQUA (2008-11) of ancient Greek and
Latin resources. While the importance of the use of standards has long been recognized in the
community, there is still a lot of variety of data types, editorial decisions and encoding
solutions, and almost one third of the original project’s resources was devoted to this initial
part. Heterogeneity is a huge problem and often a result of distributed teams working on data
with different research focus, different skills sets, and different tools. In addition, there is
heterogeneity of data models (XML, databases), technical and infrastructure heterogeneity, as
well as semantic heterogeneity. The common approach to solving these issues is of course the
use of standards, e.g. TEI/XML, DocBOOK, ePUB, TCF, but frequently the need to create
extensions creates dialects of your chosen standard. The solution this paper proposes is schema
matching. It consists of schema mapping (correspondences of elements in two schemata) and
the automated detection, or schema matching, of these correspondences. This approach is often
used in scenarios of merging large DBMS. The methods used include profiles (pairwise
comparison of elements), features (name similarity, path similarity), instance-based features,
and distribution-based features. The corpora used for the work were different versions of the
Duke Databank of Documentary Papyri (DDoDP) in TEI/XML, its EpiDoc-encoded equivalent
(Epiduke) and an extraction from the latter stored in a flat relational schema. To find
corresponding elements, the techniques of fingerprinting of elements, pairwise linking, and
scoring according to similarity were used. All results were normalized to the interval [0,1],
where '0' corresponds to no similarity and '1' to identity. The various approaches have shown
that especially semantic approaches are promising for identifying similar elements. For cases
with very little semantic overlap, structural analysis can also be taken into account, but have
proven only to be valuable where complex structures exist. Future work will include the
automatic determination of weights, text normalization strategies, analysis of micro structures
(key words, NER, chunking). New use cases are also required.
Session three
“gMan: Creating general-purpose virtual environments for (digital) archival research” Mark Hedges, Centre for e-Research, King’s College London
Project gMan <http://gman.cerch.kcl.ac.uk/> is a JISC-funded project, part of their VRE
programme, which builds on initial work done by DARIAH. The context for the project is that
the increasingly data-driven humanities has led to the development of a variety of VREs for a
number of purposes, specific disciplines, or activities. gMan investigates the possibility of a
more general purpose built system that supports day-to-day activities (simple generic actions,
complex processes). The aim of gMan is to build a framework for generic VREs, to support
research built up from generic actions. Built on the gCube development platform
<http://www.gcube-system.org/>, content and content model primitives are expressed in RDF.
6
Thus it enables both the creation of a VRE management framework and a Virtual Organization
(VO) model. Features of the system include the ability to import data and to work on it
collaboratively. It offers tools as well as data. As an evaluation exercise of its key
functionalities, the system has been tasked with assembling virtual collections, searching
across collections, adding annotations, adding links between objects, searching annotations,
and sharing materials with colleagues. Some of the challenges were that the data import
procedures were created by the gCube team, and adaptations between the data models were
necessary, which is not really scalable. The infrastructure is based on EU project funding and is
therefore not sustainable. A humanities VO needs its own sustainable infrastructure. The
current infrastructure is really a mix of services that are provided by a number of different
entities and sustained in different ways, this is something that needs to be addressed in the
future. The system has its first application in the EHRI European Holocaust Research
Infrastructure project, which has a large number of fragmented and dispersed archival sources,
and draws on DARIAH curation services for their collection. gCube offers a working research
environment, not a publication framework, but it offers an insight into the research process,
allows for provenance and justification, and can enhance publication.
“Opening the Gates: A New Model for Edition Production in a Time of Collaboration” Meagan B. Timney, Electronic Textual Cultures Laboratory, University of Victoria
Digital literary studies have long developed a system of working with texts digitally and in a
digital editions context. Electronic digital editions require a revisit in light of the new
collaborative nature of the Web. Editions used to rely on the expertise of a single person or
small group of people. In the digital world this limitation appears artificial and unnecessary.
Instead we need a dynamic edition model for representing text, a combination of text and tools
that offers a dynamic interface to the edition. Hypertextual editions offer the easy connection
of disparate resources, but presuppose a large library and vast full-text base. These two models
need to be united in a scholarly edition. The scholarly primitives developed by John Unsworth
and others have been taken as the basis for the construction of the functionalities of the
dynamic digital edition. This new type of edition challenges the authority of the editor of
traditional archives and editions, and the social dynamics create a social edition that is
collaborative and co-creative. The work of the editor is then to curate the contributions of the
individual contributors. Many of the Web 2.0 labels can be transferred to social editions, from
incompleteness to open source. The paratext instead of the text thus becomes the focal point.
The community of users is the underlying authority of the social edition. Interpretative changes
based on user input are at the heart of the social edition. It prioritises fluidity over stasis.
Community-driven collaborative editing is at the centre of this new departure into a social
digital humanities.
“When to ask for help: Evaluating projects for crowdsourcing” - Peter Organisciak, University
of Illinois
Crowdsourcing represents collaborative work broken down into little tasks. The central
question is when is crowdsourcing appropriate? How do you entice a crowd to care? How can
crowdsourcing be utilised in the set framework of a research project. The presenter has
investigated a sample of 300 sites that carry the term crowdsourcing in delicious tags and
7
employ the method for their projects. Common methods employed include encoding
aggregation (perception-based tasks utilizing human capacity for abstraction and reasoning),
knowledge aggregation (utilizing what people know, whether facts or experiences), and skills
aggregation. The primary motivators identified have been interest in the topic, ease of entry
and participation, altruism and meaningful contribution, sincerity, appeal to knowledge, and
money. Secondary motivators are any indicators of progress and one's own contribution as well
as positive system acknowledgement and feedback. Future work needs to take into account that
the barriers to crowdsourcing are falling, and that it is becoming easier, therefore more specific
investigations into academic projects are required. However, crowdsourcing is also under some
criticism as being “unscholarly”, ethnically questionable, misusing funding resources, and
distastefully publicity-seeking.
Session four
“Evaluating Digital Scholarship: A Case Study in the Field of Literature” - Susan Schreibman,
Digital Humanities Observatory; Laura Mandell
The evaluation of digital scholarship has been a central, yet vexing issue for a long time, and
the topic of many articles and discussions. The key is not simply evaluating but valuing digital
scholarship. Its legitimacy and contribution to the field must be acknowledged. Unfortunately,
digital scholarship has coincided with strains on humanities funding and struggling academic
presses. The many digital outputs of the digital humanities are not fitted to print publication
and are often going unnoticed and unrewarded. Digital scholarship is also collaborative by
nature, and there is little facility to recognize these collaborations and to reward them. Digital
outputs have also often been delegated to a secondary form after the print publication, a mere
adjunct. Many projects in the digital humanities that provide a service are also less frequently
acknowledged. Archives are often dismissed as what librarians do, programming as what
technologists do, user education as what teachers do. Research is not seen as embedded in
these endeavours. So it’s about how we define research in our area: the many various
expressions of digital scholarship need to be taken up by evaluation bodies and recognized as
scholarly activities and outputs. The humanities as a discipline have to broaden traditional
concepts of scholarship to recognise these new developments. The presenters have created a
wiki to collect these points. There was also a workshop around the new ways the digital
humanities work. The start was the evaluation of digital editions, one of the longest traditions
in the digital humanities, but again editions, print or digital, are not considered as scholarship.
Technologies in the digital humanities are perceived as jargon. What is vital is a departure from
print-based peer-review systems. However, we cling onto them for lack of a better solution.
Online publication is now so easy that everything can be online. In traditional thinking, the
quality is only visible in the publishing body behind the publication, e.g. an academic press.
There is no question: digital scholarship must be evaluated, scrutinized, and pass the
requirements of scholarship, but traditional evaluation systems fail to see beyond the prestige
of the presses. Some first steps have been made: an NEH-funded NINES workshop has
produced guidelines on how to evaluate digital scholarship, another workshop to follow.
Authorship in the context of collborations is still one of the central problems and will be
addressed specifically in these reports. The hope is to develop new modes of rewarding the
hard work that goes into creating digital resources whose impact goes far beyond their own
field.
8
“Modes of Composition in Three Authors” - David L. Hoover, New York University
This paper investigates the question of how mode of composition affects literary style. It
investigates three writers who changed their mode of composition, from handwriting to
dictation (and back), either temporarily or permanently, namely Thomas Hardy, Joseph
Conrad, and Walter Scott. These are cases in which the details of composition are well known
and in which the changes take place within a single text. Hardy's novel A Laodicean is a good
example as the change in mode occurred after the first three instalments of the novel and
switched back in the final sections. The change of mode from handwriting to dication was
caused by an illness which required him to lie on his back and from which he only slowly
recovered. Word frequency analysis is used as a way of analysing stylometry. It became clear
very quickly that there is no fundamental stylistic difference between the modes of
composition, and a variety of different stylistic methods (mean word length and mean sentence
length) verify the observation. While the employed techniques are able to find subtle changes,
this is not based on the mode of composition as assumed in Hardy. Style is dictated by the
narrative structure, the progression of the story, not mode of composition. Conrad dictated
parts of three of his works, The End of the Tether, The Shadow-Line, and The Rescue. As with
Hardy, though the reason for the switch of mode was a different one in each case, there is little
evidence from the stylometric analysis of any change attributable to the change of mode of
composition. Again narrative structure is a much more powerful influence. Finally, the same
can be observed in Walter Scott's novels The Bride of Lammermoor and Ivanhoe. More
investigations into other forms of mode changes, such as typewriting and word-processing, will
be necessary before any generalizations are possible.
“Names in Novels: an Experiment in Computational Stylistics” - Karina van Dalen-Oskam,
Huygens Institute for the History of the Netherlands - KNAW, The Hague
This paper presents work made possible by the increasing availability of digital texts and tools
in the field of literary onomastics, the study of the usage and functions of names in literary
texts. The aim is to compare the usage and functions of proper names in literary texts between
oeuvres, over time, and places. The use of a quantitative approach is helpful to discover what is
really happening in novels and leads to new questions. Named entity recognition and
classification (NERC) is used as a proven method for this study. The corpus is comprised of 44
English and 20 Dutch novels. The focus is on novels published in the last 20 years. There are
several levels of encoding in name usage: tokens, lemmas (normalized forms), mentions (one
or more tokens), and entities (with one or more different lemmas); in name categories: personal
names, geographical names; and in reference types: plot-internal names, and plot-external
names. Not much difference could be found between the two language corpora with regard to
names. Genre plays a bigger part, e.g. children's books have a lot more named entities than
non-children's books. There was also a difference between originals and translations, which led
to a difference in both tokens and mentions. An example is Bakker's Boven is het still and its
translation The Twin: there are not many names in the novel, and the translator has not added
any names. A comparison of lemmas reveals that this novel has, unusually, more geographical
names than personal names, and closer investigation shows that travel plays a special part in
the novel. Explicitly formulated travel routes explain the territorial taboos of the main
character. The geographic names thus reveal an important plot point. One important outcome
of the project is the importance of genre. Children's books form a distinct group among the
9
novels investigated. The challenges for such work are that there are still not many novels
available in digital form, tokenization is chaotic when used across languages, NERC tools do
not really speed up the work or enhance the quality, and the drawback of focussing on only one
level of encoding is inconsistent results when applied to only one novel vs. a corpus. What is
needed is a comprehensive textual work environment, a combination of concordancing and
tagging options, and dynamic self-learning NERCs, as well as options for statistical analysis.
Session five
“The Interface of the Collection” panel session - Geoffrey Rockwell, University of Alberta,
and members of the INKE project
Geoffrey Rockwell introduced the panel in which members of the Interface Design team of the
Implementing New Knowledge Environments (INKE) project discussed the manner in which
interfaces are influenced by the structure of the materials included, by the history of traditions
of representing collections, and by the intended use and needs of the users. It will address
questions such as how have interfaces changed with the move to digital; how do we interact
with interfaces; how do interfaces determine our engagement with the artefact? As the corpus
is the body of the scholarly web, information is channelled through interfaces. Interfaces
mediate content and filter information.
“The Citation from Print to the Web” - Daniel Sondheim, University of Alberta
For scholars, citations are an important interface feature and the multitude of citation design
patterns from earliest times is testimony to their importance. These patterns are: absence,
juxtaposition, the canonical citation, the footnote, and the citation of other media. The design
of the citation tells how we conceive of the topology of knowledge and the relationships
between its constituting sources. The canonical citation is integrated into the text, it is a link to
an external work that is not dependent on a particular source edition. Canonical citations on the
Web exist e.g. in the Canonical Text Services (CTS) Website. Juxtapositions are inline
citations that are more clearly distinct from the text than a canonical citation, they are
highlighting the differences between the body of the text and the annotations, yet also maintain
a closeness between them as they appear usually in the same place. Marginalia are a different
kind of juxtaposition, even more closely connected to the main body of the text. Footnotes are
a type of elsewhere notes, where a symbol is used to connect the body to the annotation. On the
Web often icons are used to highlight these connections, like the camera icon on Google Earth.
These three types of citations highlight the relationships between different types of knowledge.
“The Paper Drill” - Stan Ruecker, University of Alberta
The presenter demonstrated a new experimental interface, The Paper Drill, for navigating
collections of articles in the humanities through citations. Scholars often use citations in
research in the form of “chaining”, i.e. the use of citations as ways of expanding one’s
bibliography for a new, yet unfamiliar, field of research. The hope is to come up with an
automated process of creating these chains of references (“citation trails”) to establish a good
set of relevant articles for a particular topic. The idea is to find a seed article that will contain a
number of top articles that get cited over and over again as relevant and are cited themselves as
10
relevant one level down. We can then identify the authors of the top articles which gives an
additional avenue into subject areas. The interface presents the most frequently cited articles in
the form of heatmaps, arranged according to date range and journal category.
“Diachronic View on Digital Collections Interfaces” - Mihaela Ilovan, University of Alberta
Interfaces to digital corpora make the collections manageable and usable. Their evolution over
time offers an interesting insight into the developments of both technology and design. The
criteria for this investigation have been age of the resources, complexity, versions
management, media implementations, and academic involvement. The study looked at Project
Gutenberg, Perseus Digital Library, and the Victorian Web. The inherent limitations of the
study are a scarcity of reliable screen captures, and sparse bibliography on the history of
interface developments. Interface design is influenced by a number of factors, such as
technology, users, and discourses. Web technology has most notably changed in the screen
resolution available, bandwidth limitations, as well as formats and file sizes, which all
ultimately influence design choices. Users influence interfaces through their developing
expectations through experiences from interacting with many collections, thus supporting
standardizations of terminologies and layout choices. The ontological discourse is also a
notable influence, e.g. in the project vs. subsequent digital library stages, which coincided with
a professionalization of bibliographic metadata. This historic approach to design studies offers
valuable insights into the evolution of design choices, but is influenced by factors not always
intrinsic to the project, such as having to tell success stories.
“The Corpus from Print to Web” - Geoffrey Rockwell, University of Alberta
This paper investigates the design and evolution of corpus interfaces by comparing features of
two domains, namely print and web. The study focuses on three types of corpora: linguistic,
literary, and artefactual corpora. Linguistic corpora are very much defined by the user
community of linguists. They offer many options for refined searches and complex
functionalities. Linguistic corpora in print do exist. They offer a streamlined view of the data,
with the usual table of contents, indices, glossaries, abbreviations as possible avenues into the
collection. On the Web, a complex browsing/search box usually replaces the table of contents
and indices. The search very much defines the way into the collection. In artefactual corpora
the limitations of print often force authors to split up their materials and to make choices about
the physical organisation of the materials. On the Web the organisation is less based on
physical organisation, but is search based and offers a variety of detail from simple thumbnails
to rich metadata-centred overviews. Printed literary corpora equally feature table of contents,
bibliography, glossary, errata, as typical organisational features. On the Web search is the
dominant feature, along with browsing well-established features, such as authors, works,
timelines, and places. To conclude, in the transition from print to web the following can be
observed: the introduction of automated search, a shift from narrative to database, different
views when decoupled from physical arrangements, new modes of interaction (dynamic views,
tours), and a loss of authorial control/interpretation (still visible in essays, tours etc.).
11
Session six
“Topic Modeling Historical Sources: Analyzing the Diary of Martha Ballard” - Cameron
Blevins, Stanford University
Historian Laurel Ulrich's A Midwife's Tale is the starting point for this study, which instead of a
traditional close reading of the diary of the 18th-century midwife Martha Ballard, uses topic
modelling to mine a digitized transcription of the diary. The diary covers 27 years, from 1785
to 1812, and contains nearly 10,000 entries, a total of 425,000 words. This case study uses the
MALLET toolkit for finding topics. While diaries are exceptionally rich historical resources,
they are also extremely challenging, often fragmented, like court records, newspapers, letters,
accounting ledgers etc. The quality of data is a huge problem. Particularly in diaries, the
inconsistent spellings (e.g. the word “daughter” is spelled fourteen different ways),
abbreviations and contractions in shorthand, all add to making the text difficult to read for
computers. Topic modelling, a method of computational linguistics that attempts to group
words together based on their appearance in the text, can transcend this messiness. MALLET
identified thirty topics, which were then manually labelled. By applying the modelled topics to
each diary entry separately, it was possible to chart the behaviour of certain topics over time.
Daily scores for certain topics reveal interesting patterns for the yearly distribution of certain
activities, such as gardening, knitting, midwifery, but also reveal interesting patterns over the
whole 27 years covered by the diary. Quantification and visualization of these patterns is the
real benefit of this method, identifying patterns sometimes not immediately visible in close
readings. Even messy sources are becoming manageable with this approach.
“An Ontological View of Canonical Citations” - Matteo Romanello, King's College London
This paper examines the use of canonical citations to Classical (Greek and Latin) texts by
scholars. Canonical citations provide an abstract reference scheme for texts similar to how
coordinates express references to places. The paper wants to make use of this long established
practice, and through extracting citations from scholarly papers create a domain ontology for
Classics. Canonical citations in Classics very much reflect an ontological view of texts,
particularly how classicists perceive ancient texts as objects. This work has resulted in the
Humanities Citation Ontology (HuCit), an ontology of the semantics of citations in humanistic
disciplines. Canonical citations are particularly useful in this respect as they are used for the
purposes of precision of identification, persistence of reference, and interoperability across the
domain. Citations of Aristotle have been chosen as an application, mainly because of the
reliability of the Bekker numbers, an early important editor of Aristotles' works, which is used
by all Classicists. FRBRoo is used as the underlying model for the ontology. It is the result of a
harmonization of FRBR with the CIDOC-CRM. Canonical citations are thus considered as
resolvable pointers in any expression/manifestation of a work. This allows the definition of
types of references and the examination of alternative representations of the same reference,
and is meant to support interoperability of tools that are currently being developed to extract,
retrieve, and resolve canonical references. This work builds on related work such as CiTO,
OpenCyc, and SWAN Ecosystem, and the work of CTS (Harvard) and CWKB (Cornell).
12
“Victorian Women Writers Project Revived: A Case Study in Sustainability” - Michelle
Dalmau and Angela Courtney, Indiana University Libraries
The Victorian Women Writers project started in 1995 as a full-text corpus, which was no
longer being added to by 2003, was then revisited in 2007, and finally relaunched in 2010. On
the level of content, the process involved the migration of old content to TEI P5, and on a
structural and institutional level in firmly embedding the project into the University's Digital
Humanities curriculum. The work very much depended on local and faculty expertise, on the
exploration and evaluation of projects, and experimentation with tools. One way of reviving
the VWW project was to use it in teaching and research, thus to build up a pool of expertise
and encoders, and to draw on expertise in the English department. This integration into the
curriculum has resulted in a number of student projects in text encoding, producing secondary
contextualizing materials, and exploring Digital Humanities tools and resources, or in creating
an online exhibit. The act of marking up texts has also inspired closer and new readings of
texts. Students understood the values behind and the principles and philosophy of structural
and semantic encoding. This work also led to the development of a prosopography, to the
addition of an annotations facility and the creation of critical introductions. Scholarly encoding
cultivates technical skills and refines critical thinking and interpretation in students. Evaluation
of the class by students was positive and collaboration was appreciated. Continuing student
involvement is one of the many positive outcomes.
“The Cultural Impact of New Media on American Literary Writing: Refining a Conceptual
Framework” - Stephen Paling, School of Library and Information Studies, University
of Wisconsin-Madison
The goal of this investigation is to extend Social Informatics to the study of literature/arts by
conducting a broad-based survey of the American literary community, writers, publishers,
journalists. Four key values were identified as the conceptual framework: positive regard for
symbolic capital, negative regard for immediate financial gain, positive regard for autonomy,
and positive regard for fresh, innovative work or work only possible electronically (avantgardeism). The study focuses on the emergence of new forms of literary expression offered by
information technology and on whether these forms are able to establish themselves alongside
traditional forms of expression in American literary writing. The main research questions are:
is the American literary community showing positive regard for this new literary innovation,
and do they show support for the use of technology in creating these innovations. A survey was
sent out to 900 members of the Association of Writers and Writing Programs, the MLA, as
well as representatives of publishers. There were 400 exclusively national respondents. The
results show that the vast majority support innovation and original works. However, when
technology comes into play these figures plummet dramatically. Generally print-based output
is regarded as having much higher quality and impact than online publications. To conclude, it
can be summarized that there is generally support for a positive regard of these innovative new
forms, but intensive use of technology is not really either well received as a single mode of
publication or indeed much produced. Only about 10% of the American literary community
demonstrates any intensifying use of technology. There is little evidence of a move of
electronic literature out into the mainstream either from a producers' or a publishers'
perspective.
13
Zampolli Lecture
“Re-Imagining Scholarship in the Digital Age” - Chad Gaffield, Professor of History at the
University of Ottawa, President of the Social Science and Humanities Research Council of
Canada
Re-imagining scholarship is prompted by a new era of scholarship, which draws upon our past
but embraces the technologies of today, particularly digital humanities technologies.
Scholarship in the digital age is changing and shaped by the work done in the Digital
Humanities today. Deep conceptual changes are underway in humanities scholarship and these
require a redefinition of our field and the methodologies we embrace. Zampolli pioneered the
collaborative aspect of digital scholarship, and created the community that we build upon
today. The interconnectedness of research is really the key realization in this dynamic field of
exploration. Michael B Katz has been influential in the exploration of the use of computers in
the 70s and 80s. Later the emergence of groups devoted to textual interpretation emerged and
put their mark on the discipline. Our focus needs to be on the new ways of thinking that are
made possible through the use of modern technology. Canada has been innovative in the
creation of a research infrastructure for the country that was qualitatively and quantitatively
reflective of the state of the nation and the needs of the community. Computer-based analyses
of long-term social change in Canada has a long and fruitful tradition. Beginning in the 70s and
80s, based on decennial census data starting in 1871, political debates, and public discussions,
historical and social research have been influenced by this rich data source. Census data was
digitized, OCR’d, marked up and has been partially made available online (1911 and some
samples from earlier census, some data from 1971 onwards). A number of digital projects have
drawn on this rich data source, among them database projects such as the Canadian Social
History Project, the Vancouver Island Project, the Lower Manhattan Project, and the Canadian
Families Project. To conclude, there are deep conceptual changes afoot that reflect new forms
of complexity, diversity, and creativity in a technology-driven age. Digital Humanities are
ideally placed to tackle some of the more complex problems, to embrace diversity and make it
workable, which empowers a larger group of people to be creative in new and exciting ways.
Re-imagining scholarship is re-imagining teaching by fostering learning and re-imagining
research by broadening horizons and embracing collaboration. The challenges remaining are
bridging the solitudes of arts/humanities, ensuring digital sustainability, managing open
innovation and intellectual property, measuring what matters, and giving acknowledgement to
the contributions of the digital humanities.
Session seven
“Moving Beyond Anecdotal History” - Fred Gibbs, George Mason University
Walter Houghton's seminal work The Victorian Frame of Mind, 1830-1870 has influenced
generations of scholars of the nineteenth century and remains the primary introduction to
Victorian thought for students today. Houghton's identification of Victorian traits such as
optimism, hero worship, and earnestness, based on the use of particular words, has been
influential if not wholly uncritically received. Houghton bases his findings on examining
hundreds of primary sources of the period, but despite criticisms of anecdotal truths based on
elite intellectual history, Victorianists have been unable to thoroughly asses the validity of the
14
assertions or to offer an alternative view. This paper hopes to examine Victorian history via
lots of books, thousands of books, instead of just “literature”, by using the resources made
available through Google Books. Methodologies are important when answering basic questions
and challenging assumptions based on a too limited number of sources. There is however a
clear tension between rhetoric and practice: the rhetoric emphasises lots of data, tools for
everything, visualize everything, explore, and offer new interpretations; in practice, we deal
with messy data, an underdeveloped understanding of text vs. data, difficult complex tools, and
opaque (even if pretty) visualizations. Further criticisms can be added: bias, sampling
problems, unclear significance, and lacunae with unknown consequences. This paper attempts
a solution by using and exposing simple techniques (scripting and queries), facilitating not-sodistant reading, and active, contextualized engagement with texts. The approach is to study the
use of keywords commonly associated with the period and apply it to the titles of books
published between 1790 and 1910. Are there peaks observable in the literature of the period,
e.g. “revolution”, “heroic”, “faith”, “commerce”, “science” as terms frequently associated with
the 19th century? The Google Books Ngram viewer <http://ngrams.googlelabs.com/> was used
initially, but found to be too simplistic, so Amazon's S3, Elastic MapReduce and Hive services
<http://aws.amazon.com/elasticmapreduce/> were used to do some more detailed explorations,
as they offer more flexibility, are cheaper than developing it yourself, and have produced tried
and tested results. The result has been that there is more than one answer to setting up queries
and visualizing any results, which are all helpful and offer more thorough insights into the new
questions we are interested in, e.g. "science of" as a term reveals the fundamental rise of
science in the 19th century as a ubiquitous term as opposed to the term “science” as we
understand it today. As the study only focused on titles, it has been impossible make any
generalizations, but analysis of the full-text will be the next step. Often results have been found
to be confusing and incomplete, but it is important to highlight the inconsistencies along with
the new and exciting questions we are able to ask as a result. There is also a danger of
concentrating on deliverables, an emphasis on novelty and production. Glossing over
inconsistencies or rationalizing them away poses the huge danger of falling back to the
assumptions we are trying to question. To conclude, big data does not require complex tools,
and makes the transitions from text to data and back transparent.
“Towards a Narrative GIS” - John McIntosh, University of Oklahoma
Narratives are frequently examined through a time-centric approach, as a series of events.
Space is less frequently considered in narrative construction and analysis. This NSF-grant
project defines a geospatial narrative as a sequence of events in which space and time are of
equal importance. The objectives are to automate collection of space-time events from
digitized documents, to develop a data model that can support event based queries, that can be
linked to narratives, and that can support queries based on content. The result is a visualization
of the geospatial narrative. The conceptual model is event-based (action, actors, objects,
location, time), these events are combined to form narratives. The data sources are two
distinctive corpuses of histories of the Civil War era, Dyer's Compendium of the War of the
Rebellion and the Richmond Daily Dispatch. The process involves the automated extraction of
information into a database, the source materials are digitized, tokenized, identified events are
extracted along with locations, subject-objects, and time references. Automation is based on
evaluating parts of speech: “action verbs” and “event nouns” are treated as atomic events,
nouns are always important, including proper nouns, time- and space nouns. The project uses
15
Python, the Natural Language Toolkit (NLTK), tokenize, and a part of speech tagger as its
workflow. For the purpose of toponym resolution, it seeks to match place names to thesauri to
identify correct real-world references. The main challenges were ambiguity, metonymy, and a
lack of comprehensive historical gazetteers. Named entity recognition and classification is used
to group consecutive proper nouns, excluding named entities in non-spatial phrases and stop
lists. Gazetteer Matching is employed to resolve multiple matches through spatial
minimizations. For the purpose of evaluation, a sample from each dataset is gone through
manually and the result of automated processes is compared with hand mark up. The project's
outputs include the development of a relational database application that allows users to query
the database themselves, to query for new event types, and to construct queries for event
chains. The project has produced code to extract narrative building blocks, a database for
storage, and a query interface for that purpose that will support the ability to query large
datasets for atomic building blocks and to combine them into narratives.
“Civil War Washington: An Experiment in Freedom, Integration, and Constraint” - Liz
Lorang, University of Nebraska-Lincoln
Civil War Washington is a thematic research collection that explores the historical
transformation of Washington, D.C. during the Civil War, when it emerged as a harbour for
freed and runaway slaves and was to lead the nation from slavery to freedom and equality for
all. The project is related to the Walt Whitman Archive, and was originally conceived of as an
exploration into Whitman and Lincoln in Washington. The dramatic events surrounding the
city in the four years of the Civil War are documented in a vast amount of documents.
Assembling and digitizing these is only one aspect, the other is analysis, visualization, and
tools-building to make the city come alive. The amount of materials is huge, the encoding
challenging and time-consuming, yet it is a pre-requisite for the work undertaken. All data is
stored in a relational database to enable connecting up all the different types of documents,
people, events, places, institutions, etc. A public interface is currently under development.
Technical infrastructure and data models are undergoing revisions all the time and are part of
the research process on the topic that should be made transparent just as any other assumption.
Visualization in the form of a GIS application is another part of the project. The project uses
ArcGIS, which has been found to be easiest to adapt and allows for more analytical
investigations of the data, but as a proprietary product it has its own challenges and limitations.
However, compromises will always be necessary due to limitations of desktop vs. Web
applications. Humanists should be more involved in the open GIS development of tools to
make them more appropriate to the field and for the many uses in digital scholarship. Civil War
Washington does not want to be a neatly packed up product, instead it wants to stimulate
research, be experimental, transparent in its assumptions and limitations. The interrelatedness
of all the data is really the key complexity, challenge, and opportunity.
Session eight
“Googling Ancient Places” - Elton Barker, Open University, et al.
The Google Ancient Places (GAP) project aims to computationally identify places referenced
in scholarly books and be a means to deploy the results via simple Web services. The project,
16
funded by the Google Digital Humanities Award programme, is based on Google Books. It
facilitates access to ancient places in Google Books, and then adds information about the
places to the texts. GAP builds on the AHRC-funded HESTIA project, but instead of relying
on heavily marked-up text provided by the Perseus Digital Library, employs computational
methods as well as additional open infrastructure in the form of new semantic gazetteers like
GeoNames and the Pleiades Project. GAP also draws on the resources of Open Context, an
open-access archaeological data publication system. These tools enable the geo-tagging of the
text and referencing the places in the Google Books corpus via easily resolvable public
identifiers. The project is currently in a proof-of-concept stage. Data and technology are freely
available and are documented in a blog <http://googleancientplaces.wordpress.com>. The
Pleiades gazetteer of ancient places is improved upon by adding modern forms of places found
in the Google Books literature. Visualizations have been produced on Google maps, among
them a visualization of books projected on maps of the Ancient world. GAP makes intensive
use of the already existing digital humanities infrastructure and must be seen in the context of
collaboration with a number of projects who are all working on similar projects and problems.
“Image Markup Tool 2.0” - Martin Holmes, Humanities Computing Media Centre, University
of Victoria
The presenter demonstrated the re-developed Image Markup Tool. Version 1 of the software
has been widely used, even in a few major projects. Its simple to use interface and full TEIawareness have made it a popular tool in the digital humanities community, but version 1 only
handles 1 image per file, it only allows rectangular zones, only a one-to-one relationship
between zones and divs, it essentially constructs the TEI document for you, and is Windowsbased. The improvements in version 2 address these issues. Flexible linking has been added to
allow for many divs to one zone, and one div to many zones relationships. Version 2 is also
written in C++ and built on the QT framework, and therefore cross-platform and open-source
(Sourceforge). The presenter gave a demo of the interface and functionality of the new version.
The new linking mechanism will be based on the TEI linkgrp mechanism and will use the back
of the document for that purpose. The desktop version is seen as preferable to server-based
solutions due to network latency, the power of a native application in C++, the familiar
interface components based on platform, and independence from a server infrastructure.
“Lurking in Museums: In Support of Passive Participation” - Susana Smith Bautista,
University of Southern California
Lurking is a term popularized with the advent of online communities as a mode of passive
participation, e.g. listening, watching, reading, attending, etc. The museums community has a
vital interest in communities around their museums that reflect visitor participation. It is
important to recognize that all types of participation are valuable for the constitution of a
certain domain, community or particular project. The key is understanding, motivating and
analysing participation. In the world of museums, lurking is important because it is a principal
mode of interaction. Museum websites have long embraced the Web2.0 community-based
support systems, and increasingly it has been recognized that lurking is only initially a passive
form of participation, it can change into activity at any time when prompted or motivated
sufficiently. Lurkers are also not necessarily loners, there are whole communities that consist
17
of a majority of lurkers and still constitute an important aspect of the whole. If we define
knowledge as a process, then actions as part of this process can be internal or external, both
equally valuable. Lurkers provide an audience, they contribute to the social ecosystem through
their mere presence. Lurking is clearly not appreciated when participation is required but not
given. Social and peer pressures are mechanisms that sometimes make lurking difficult.
However, lurking is not opposed to participation, indeed museums must encourage it to be able
to nourish, motivate, and transform it into interactivity when required.
“RELIGO: A Relationship System” - Nuria Rodriguez, University of Málaga (Spain) and
Dianella Lombardini, Scuola Normale Superiore (Italy)
RELIGO is a relationship system designed to express the nature and characteristics of relations
between digital objects hosted in digital libraries and text archives, it is a tool for interpretative
study. These relationships can lead to new insights and the creation of knowledge and thus
deserve to be treated as research objects in themselves. RELIGO is a system designed to
construct interpretations based on these significant relationships, it relates texts, concepts,
words, and visual artefacts. RELIGO relates these entities on two logical levels, the expression,
i.e. the digital objects, and the semantics, i.e. digital concepts, that allow the interpretation to be
built. Therefore a digital concept can itself become a digital object when it is the subject of
interpretation. RELIGIO makes these connections searchable and browsable, they are
represented in the form of navigable graphs, interpretations can be reconstructed by simply
following the paths between the digital objects. Documents can be imported in PDF, images as
JPG, concepts as TopicMaps, relationships as ontologies, digital objects are represented in
XML/SVG. Ongoing work involves metadata management, export and viewing functionalities,
a Web based version, and content sharing and social tagging, as well as management of more
object types such as audio and video.
“Omeka in the Classroom: The Challenges of Teaching Material Culture in a Digital World”
– Allison Marsh, University of South Carolina
It is surprisingly difficult to convince faculty and students of the values of digital research. The
digital world, though part of their lives as “digital natives”, is often not considered part of their
professional lives. In the museum studies community and teaching programmes, material
culture in the digital world remains a challenge. The presentation reported on the experiences
had with the Omeka open-source software <http://omeka.org/> as a teaching tool for graduate
students. The software is used due to its low use barriers and as it allows for adding content
and interpretations. Students of the course are required to produce exhibits using the system,
add images and DC metadata, organize materials and construct a narrative of the materials.
Online exhibits are an excellent way of training the curation of material objects in the digital
world and are thus a useful tool in the curriculum. While the results of students’ efforts are
often lacking in both content and presentation, Omeka has proved to be an excellent
pedagogical tool, as using the software has made students aware of the many challenges
involved in representing material culture online. One outcome of the experience of teaching the
course will be some improvements to the Omeka software.
18
Closing keynote
“Culturomics: Quantitative Analysis of Culture Using Millions of Digitized Books” – JB
Michel & Erez Lieberman-Aiden, Cultural Observatory, Harvard University
In a library we can either read a few books, carefully, or read all the books, not very carefully.
This presentation aims to offer a quantitative exploration of cultural trends, focussing on
linguistic and cultural changes reflected in the English language between 1800 and 2000. How
does culture change? How does language change? An example is irregular verbs that have
regularized over time. The approach was to investigate grammar books as a starting point to
trace the regularization of these verbs. It was discovered that many of the 177 irregular verbs
present in Early Modern English got regularized. In particular the rarely used ones got
regularized more quickly than often used ones. How can we measure cultural trends more
generally? First, we need the world’s largest corpus of digitized books, so a sample of 5
million Google books (with good quality metadata) was used out of the 15 million Google
books corpus. The aim is to automate the process of analysis of cultural phenomena over time.
N-gram analysis, based on the Google Books Ngram viewer, is used as a method for the
investigation. Many built-in controls were necessary to verify the quality of the dataset, which
was then taken as the data source. A lot of the time the so-called Scientific Method involves
trial and error. There are lots of problems with certain n-grams, particularly in pre-1800 books,
bad dates, OCR errors, noise, bias of the corpus, random rubbish are just a few of the issues.
But sometimes you can find examples which are worthy of further attention and quantification
and sometimes reveal interesting results. Censorship is an extremely interesting case, e.g.
during the Nazi regime certain names of artists etc. get blocked out completely while in other
countries they develop in the literature as you would expect. The same can be observed during
the McCarthy era. The Google Books Ngram viewer was created to make the data and their
analysis available to the public, transparent and open. There are even spin off projects like an
Ngram viewer for musical notes. We are thus on a good way to a fully-fledged Culturomics:
we can create huge datasets, we can digitize every text written before 1900, we should create
high-quality images of art works, we need to track cultural changes on the Web, we must make
everything available to everybody. We will also need to work together to make all of this work,
in large teams distributed all over the world. We need to embrace the expertise of the sciences
and use their infrastructure and resources for the humanities. We need to embrace expertise
wherever we can find it, from how to read texts to how to interpret our observations. We need
to learn to interpret data (from scientists), and to interpret texts (from humanists). We also need
to teach humanities students to code and to be quantitatively rigorous in their approach.
Culturomics thus extends the boundaries of rigorous quantitative inquiry to a wide array of
new phenomena spanning the social sciences and the humanities.
It was also announced that next year’s Digital Humanities conference will be hosted by the
University of Hamburg and the 2013 conference will be hosted by the University of NebraskaLincoln.
12/07/2011
Alexander Huber, University of Oxford
<http://users.ox.ac.uk/~bodl0153/>
19
Download