From entity description to semantic analysis: The case of Theodor

advertisement
From entity description to semantic analysis: The case of Theodor
Fontane’s notebooks
Martin de la Iglesia and Mathias Göbel, SUB Göttingen
Within the last decades, TEI has become a major instrument for philologists in the digital
age, particularly since the recent incorporation of a set of mechanisms to facilitate the
encoding of genetic editions. Editions use the XML syntax while aiming to preserve the
quantity and quality of old books and manuscripts, and to publish many more of them
online mostly under free licences. Scholars all over the world are now able to use huge
data sets for further research. There are many digital editions available, but only a few
frameworks to analyse them. Our presentation focusses on the use of web technologies
(XML and related technologies as well as JavaScript) to enrich the forthcoming edition of
Theodor Fontane’s notebooks with a data driven visualisation of named entities and to
build applications using such visualisations which are reusable for any other edition within
the world of TEI.
State of the art
The TEI Guidelines provide various mechanisms for tagging references to entities in texts,
as well as solutions for encoding metadata supplied by editors about such entities. Such
methods are frequently employed in digital editions. For example, on the website of the
edition of John Godwin’s diaries¹ we are able to highlight the names within the text in
different colors. Often these parts are rendered in HTML as <acronym> and are equipped
with a <div> box containing further information that pops up as the user clicks on or hovers
over them. This is a simple and easy to use way to deliver further information and some
search options, but it does not per se facilitate a detailed analysis.
With help of the <speaker> tag within TEI encoded drama, a quantitative analysis of
spoken words becomes possible. One example is provided by the Women Writers Project,
that visualize speakers in drama by gender.² It is also possible to get a quantitative
overview of the coappearance of two or more characters, which is done for Victor Hugo’s
Les Misérables with the help of the D3.js JavaScript library.³
Persons and places seem to be the most common types of tagged entities. These are
usually normalized, i.e. spelling variations are merged and matched to an authoritative
name, and some additional data not found in the encoded source text is provided – most
commonly biographical dates for persons and geographic coordinates for places.
Additional data might include excerpts from encyclopedias, or map visualisations of the
location of places. In the case of most editions, the usage of entity encoding can be
characterised as descriptive, rather than analytical: information is provided about entities,
but the way in which they are referenced in source texts and how the entities relate to
each other is recorded and used for navigational purposes only. This paper, employing the
example of a TEI edition project of 19th century notebooks, discusses further potential
uses of such TEI encoded semantic annotations.
Theodor Fontane’s notebooks
From 1859 until the late 1880s, the German poet Theodor Fontane (1819–1898) filled
almost 10,000 pages in 67 notebooks, which have not yet been published in their entirety.
They include diary entries, travel notes, theater criticism and drafts for novels and poems,
resulting in a wide spectrum of text types and images.⁴ The complete edition of the
notebooks both in print and online is being prepaired at the Theodor Fontane-Arbeitsstelle,
Department of German Philology at Göttingen University, in collaboration with the
Göttingen State and University Library.⁵ In his notebooks, Fontane made extensive use of
underlining, cancellations, corrections and additions, and consequently the crucial aspect
of the philological edition project is to precisely transcribe, encode, annotate and visualize
the appearance of Fontane’s handwriting, in order to help the reader to decipher and
understand it. Another important task within this project, however, is to identify and encode
references to entities in the notebooks.
These include:
 persons, organizations – linked to authority files such as GND⁶ or VIAF⁷, online
historical encyclopedias
 places – all of the above, plus linked to geographical databases such as
GeoNames or the Getty Thesaurus of Geographic Names
 dates – normalized to machine-readable standards, so that dates can be sorted
and durations calculated
 artworks, buildings – linked to their creators, locations, and provided with their
dates of creation
 literary works, musical works – linked to their authors and, where applicable,
online versions
 events (e.g. battles) – linked to places and provided with dates
 characters in works of fiction – linked to the respective works.
Because of the density of occurrences and the variety of entity types, Fontane’s notebooks
lend themselves to advanced methods of semantic analysis.
Semantic analysis
These entity occurrences are encoded in a fairly common way, using <rs> elements which
link to lists of elements in which the entities are described and linked to external authority
records, and <date> elements in the case of chronological references. At a later project
stage, we will explore the possibilities to derive other formats from this data which facilitate
the extraction and processing of their semantic content, such as Geography Markup
Language (GML)/Keyhole Markup Language (KML) for spatial data, or CIDOC-CRM for
events. This paper will explore how our entity data, which is available in similar form in
many other TEI encoded editions, can be put to use in ways that go beyond the traditional
uses described above, and which enter the realm of semantic analysis.
Examples include:
 counting entities and calculating their relative frequency. We expect a high
number and a concentration for pages where we can find short notations or
lecture notes. Thus, we hope to be able to distinguish these parts from literary
manuscripts;
 enriching personal data with birth and death dates from authority files and
calculating differences in order to identify historical strata;

identifying co-occurrences of persons and other entities and constructing
networks in order to calculate graph theoretical measures;
 connecting places to routes, visualizing them on maps and calculating their
distances using coordinates from external databases. Place entity references
can occur in several different roles⁸: in this context, we must distinguish places
visited by Fontane where he took notes, and distant places only mentioned by
Fontane. It will be of interest to analyse the differences and similarities between
these two geographic networks, particularly when a chronological dimension
(i.e. the date of Fontane’s visit, or the date of a historic event referred to by
Fontane which took place at a mentioned site) is added;
 comparing Fontane’s statements about entities, such as dates, locations, and
names, with what we know about them today.
These data aggregations will be provided to the user as interactive graphics using D3.js or
in the case of locations connected to a specified time or period, using the DARIAH
GeoBrowser e4d⁹. Therefore we develop XSLT transformation scenarios, build with
XQuerys within our exist-db (project portal), that delivers the needed JSON (D3.js) or KML
(e4d10) formats and transfer these data sets using appropriate interfaces.
Bibliography
[1] James Cummings, “The William Godwin’s Diaries Project: Customising and
transforming TEI P5 XML for project work”, in: Jahrbuch für Computerphilologie 10
(2008), http://computerphilologie.de/jg08/cummings.pdf (April 29, 2009), last visited on
March 27, 2013
[2]
Women
Writers
Project,
“Women
Writers
Online”,
http://www.wwp.brown.edu/wwo/lab/speakers.html, last visited on March 27, 2013
[3] Mike Bostock, “Force Directed Graph”, http://bl.ocks.org/mbostock/4062045, last visited
on March 27, 2013; based on data provided by Donald Knuth, “The Stanford
GraphBase: A Platform for Combinatorial Computing”, Reading 1993
[4] Gabriele Radecke, “Theodor Fontanes Notizbücher. Überlegungen zu einer
überlieferungsadäquaten Edition”, in: Martin Schubert (Ed.), Materialität in der
Editionswissenschaft, Berlin 2010 (= Beihefte zu editio; Bd. 32), pp. 95–106. – The
Berlin State Library is the owner of the notebooks and an associated partner of the
project.
[5]
Project
website
http://www.unigoettingen.de/de/303691.html
and
http://www.textgrid.de/community/fontane/
[6] Gemeinsame Normdatei / Integrated Authority File of the German National Library,
http://www.dnb.de/EN/Standardisierung/Normdaten/GND/gnd_node.html, last visited
on March 27, 2013
[7] Virtual International Authority File, http://viaf.org/, last visited on March 27, 2013
[8] Humphrey Southall, “Defining and identifying the roles of geographic references within
text: Examples from the Great Britain Historical GIS project”, in: Proceedings of the
HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pp. 6978, doi:10.3115/1119394.1119405
[9] europeana4D: exploring data in space and time, http://dev2.dariah.eu/e4d/, an example
using the content from one single page can be found at http://goo.gl/TSNDf, last visited
on March 27, 2013
[10] EuropeanaConnect: "KML Specifications", http://tinyurl.com/e4d-kml, last visited on
June 27, 2013
Download