From entity description to semantic analysis: The case of Theodor Fontane’s notebooks Martin de la Iglesia and Mathias Göbel, SUB Göttingen Within the last decades, TEI has become a major instrument for philologists in the digital age, particularly since the recent incorporation of a set of mechanisms to facilitate the encoding of genetic editions. Editions use the XML syntax while aiming to preserve the quantity and quality of old books and manuscripts, and to publish many more of them online mostly under free licences. Scholars all over the world are now able to use huge data sets for further research. There are many digital editions available, but only a few frameworks to analyse them. Our presentation focusses on the use of web technologies (XML and related technologies as well as JavaScript) to enrich the forthcoming edition of Theodor Fontane’s notebooks with a data driven visualisation of named entities and to build applications using such visualisations which are reusable for any other edition within the world of TEI. State of the art The TEI Guidelines provide various mechanisms for tagging references to entities in texts, as well as solutions for encoding metadata supplied by editors about such entities. Such methods are frequently employed in digital editions. For example, on the website of the edition of John Godwin’s diaries¹ we are able to highlight the names within the text in different colors. Often these parts are rendered in HTML as <acronym> and are equipped with a <div> box containing further information that pops up as the user clicks on or hovers over them. This is a simple and easy to use way to deliver further information and some search options, but it does not per se facilitate a detailed analysis. With help of the <speaker> tag within TEI encoded drama, a quantitative analysis of spoken words becomes possible. One example is provided by the Women Writers Project, that visualize speakers in drama by gender.² It is also possible to get a quantitative overview of the coappearance of two or more characters, which is done for Victor Hugo’s Les Misérables with the help of the D3.js JavaScript library.³ Persons and places seem to be the most common types of tagged entities. These are usually normalized, i.e. spelling variations are merged and matched to an authoritative name, and some additional data not found in the encoded source text is provided – most commonly biographical dates for persons and geographic coordinates for places. Additional data might include excerpts from encyclopedias, or map visualisations of the location of places. In the case of most editions, the usage of entity encoding can be characterised as descriptive, rather than analytical: information is provided about entities, but the way in which they are referenced in source texts and how the entities relate to each other is recorded and used for navigational purposes only. This paper, employing the example of a TEI edition project of 19th century notebooks, discusses further potential uses of such TEI encoded semantic annotations. Theodor Fontane’s notebooks From 1859 until the late 1880s, the German poet Theodor Fontane (1819–1898) filled almost 10,000 pages in 67 notebooks, which have not yet been published in their entirety. They include diary entries, travel notes, theater criticism and drafts for novels and poems, resulting in a wide spectrum of text types and images.⁴ The complete edition of the notebooks both in print and online is being prepaired at the Theodor Fontane-Arbeitsstelle, Department of German Philology at Göttingen University, in collaboration with the Göttingen State and University Library.⁵ In his notebooks, Fontane made extensive use of underlining, cancellations, corrections and additions, and consequently the crucial aspect of the philological edition project is to precisely transcribe, encode, annotate and visualize the appearance of Fontane’s handwriting, in order to help the reader to decipher and understand it. Another important task within this project, however, is to identify and encode references to entities in the notebooks. These include: persons, organizations – linked to authority files such as GND⁶ or VIAF⁷, online historical encyclopedias places – all of the above, plus linked to geographical databases such as GeoNames or the Getty Thesaurus of Geographic Names dates – normalized to machine-readable standards, so that dates can be sorted and durations calculated artworks, buildings – linked to their creators, locations, and provided with their dates of creation literary works, musical works – linked to their authors and, where applicable, online versions events (e.g. battles) – linked to places and provided with dates characters in works of fiction – linked to the respective works. Because of the density of occurrences and the variety of entity types, Fontane’s notebooks lend themselves to advanced methods of semantic analysis. Semantic analysis These entity occurrences are encoded in a fairly common way, using <rs> elements which link to lists of elements in which the entities are described and linked to external authority records, and <date> elements in the case of chronological references. At a later project stage, we will explore the possibilities to derive other formats from this data which facilitate the extraction and processing of their semantic content, such as Geography Markup Language (GML)/Keyhole Markup Language (KML) for spatial data, or CIDOC-CRM for events. This paper will explore how our entity data, which is available in similar form in many other TEI encoded editions, can be put to use in ways that go beyond the traditional uses described above, and which enter the realm of semantic analysis. Examples include: counting entities and calculating their relative frequency. We expect a high number and a concentration for pages where we can find short notations or lecture notes. Thus, we hope to be able to distinguish these parts from literary manuscripts; enriching personal data with birth and death dates from authority files and calculating differences in order to identify historical strata; identifying co-occurrences of persons and other entities and constructing networks in order to calculate graph theoretical measures; connecting places to routes, visualizing them on maps and calculating their distances using coordinates from external databases. Place entity references can occur in several different roles⁸: in this context, we must distinguish places visited by Fontane where he took notes, and distant places only mentioned by Fontane. It will be of interest to analyse the differences and similarities between these two geographic networks, particularly when a chronological dimension (i.e. the date of Fontane’s visit, or the date of a historic event referred to by Fontane which took place at a mentioned site) is added; comparing Fontane’s statements about entities, such as dates, locations, and names, with what we know about them today. These data aggregations will be provided to the user as interactive graphics using D3.js or in the case of locations connected to a specified time or period, using the DARIAH GeoBrowser e4d⁹. Therefore we develop XSLT transformation scenarios, build with XQuerys within our exist-db (project portal), that delivers the needed JSON (D3.js) or KML (e4d10) formats and transfer these data sets using appropriate interfaces. Bibliography [1] James Cummings, “The William Godwin’s Diaries Project: Customising and transforming TEI P5 XML for project work”, in: Jahrbuch für Computerphilologie 10 (2008), http://computerphilologie.de/jg08/cummings.pdf (April 29, 2009), last visited on March 27, 2013 [2] Women Writers Project, “Women Writers Online”, http://www.wwp.brown.edu/wwo/lab/speakers.html, last visited on March 27, 2013 [3] Mike Bostock, “Force Directed Graph”, http://bl.ocks.org/mbostock/4062045, last visited on March 27, 2013; based on data provided by Donald Knuth, “The Stanford GraphBase: A Platform for Combinatorial Computing”, Reading 1993 [4] Gabriele Radecke, “Theodor Fontanes Notizbücher. Überlegungen zu einer überlieferungsadäquaten Edition”, in: Martin Schubert (Ed.), Materialität in der Editionswissenschaft, Berlin 2010 (= Beihefte zu editio; Bd. 32), pp. 95–106. – The Berlin State Library is the owner of the notebooks and an associated partner of the project. [5] Project website http://www.unigoettingen.de/de/303691.html and http://www.textgrid.de/community/fontane/ [6] Gemeinsame Normdatei / Integrated Authority File of the German National Library, http://www.dnb.de/EN/Standardisierung/Normdaten/GND/gnd_node.html, last visited on March 27, 2013 [7] Virtual International Authority File, http://viaf.org/, last visited on March 27, 2013 [8] Humphrey Southall, “Defining and identifying the roles of geographic references within text: Examples from the Great Britain Historical GIS project”, in: Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1, pp. 6978, doi:10.3115/1119394.1119405 [9] europeana4D: exploring data in space and time, http://dev2.dariah.eu/e4d/, an example using the content from one single page can be found at http://goo.gl/TSNDf, last visited on March 27, 2013 [10] EuropeanaConnect: "KML Specifications", http://tinyurl.com/e4d-kml, last visited on June 27, 2013