European Science Foundation Exploratory Workshop, Birmingham (UK), 30 March – 1 April 2009 Applying Semantic Web technologies to medieval manuscript research Position Paper Toby Burrows, ARC Network for Early European Research Medieval manuscript research is a complex, fragmented, multilingual field of knowledge, which is difficult to navigate, analyse and exploit. Though printed sources are still of great importance and value, there is a large and rapidly growing body of material on the Web. Much of this Web material consists of information about manuscripts, though a considerable amount of digitization and transcription has also been carried out. This European Science Foundation Exploratory Workshop focuses on the possibilities for applying new Semantic Web technologies to enhance medieval manuscript research. These technologies are intended to represent a complex body of knowledge in standardized ways, and to enable sophisticated discovery and reasoning tools to be applied to data and documents across the Web. These technologies have the potential to enhance medieval manuscript research greatly, by enabling much more effective access to, and use of, relevant materials and knowledge. Imagine a Web service through which you could readily find all manuscripts of relevance to the research question you are investigating, and be pointed to previous work about them and to digital representations of them… Medieval manuscripts: research questions Medieval manuscripts are used in addressing a wide range of research questions. Most obviously, these include research into the characteristics of manuscripts themselves as physical objects. These characteristics include: the place of origin, the date or period of origin, the materials used, the decoration and illumination, the handwriting, the scribe, the binding, arrangement of the physical volume, and the language. Research into the subsequent history of a manuscript looks at its owners and at changes to its appearance over time, as well as at its modern location, and its place in modern collections. Relationships between manuscripts are a common topic, including research which reunites dispersed leaves of what was originally a single manuscript. Identifying connections between specific medieval manuscripts and other materials which survive from the medieval period (especially art works, buildings, and other material objects) is another closely related area of research. Defining these physical characteristics also forms the starting-point of many research projects, e.g. defining specific styles of handwriting, establishing different categories of decoration, and identifying different ways of arranging or binding physical volumes in the medieval period. Fundamental to all these kinds of research projects is the availability of standardized descriptions of manuscripts as physical objects. In the digital arena, the most promising approach to standardizing descriptions is that of the Text Encoding Initiative (TEI) (1). The other major area of research involves the use of manuscripts as evidence for all aspects of life in the medieval period. This requires knowledge and understanding of the content of a manuscript – the text, the images, the music, etc. European Science Foundation Exploratory Workshop, Birmingham (UK), 30 March – 1 April 2009 This kind of research is heavily dependent on the descriptors used to identify the content, including authors’ names, titles of works, incipits (opening words), subject and concept terms, and so on. Both these major areas of research also draw on a knowledge of the secondary literature relating to specific manuscripts: catalogues and descriptions (medieval and modern), secondary works, bibliographies, etc. These are likely to reflect changes over time – as concepts shift, and descriptions and attributions are revised. All aspects of the body of knowledge in this field are also multilingual; descriptions and descriptors may be in a variety of (mainly European) languages. Web services for medieval manuscript research There are many existing Web services relevant to medieval manuscript research. At present, they have to be consulted separately and individually – though search engines like Google cover some of them. These services employ a range of different descriptive standards and vocabularies, and use a variety of different technologies to make their information available on the Web. Numerous collecting institutions provide information about the manuscripts they hold, either as part of more general databases or as specific manuscript databases (2-3). There are a range of national databases (4-5) as well as a small number of international databases (6-8). Some of these services provide digital images of manuscripts as well as descriptive information about them. Europeana, for instance, focuses specifically on digitized materials, but its scope is much broader than manuscripts (9). There are many Web sites which list, transcribe, or provide digital images of manuscripts of a specific text or relating to a specific medieval author (10-11). Ancillary Web services include sites devoted to manuscript terminology and vocabularies (e.g., 12), incipits (13), subjects (14), authors (15-16), and people more generally (17-20). Other services provide indexes to journal articles, scholarly books and other secondary literature about specific manuscripts (e.g. 21-22). Semantic Web technologies Semantic Web technologies are methods for adding semantic structures to Web data and documents, with the broad aim of making them more interoperable and automatically discoverable. The main building-blocks for Semantic Web services are as follows. Object identifiers: machine-processable alpha-numeric addresses (URIs) used to identify an object uniquely. It is possible to assign identifiers to abstract “objects” like concepts, subject terms, personal names and place names, as well as to physical objects like manuscripts. Ontologies and ontological languages: ways of structuring the relationships between elements of a body of knowledge, expressed in a formal language like European Science Foundation Exploratory Workshop, Birmingham (UK), 30 March – 1 April 2009 OWL or SKOS. The result is a machine-readable conceptual map of a domain of knowledge. RDF databases: collections of statements about objects, their properties, and their relationships (“triples”), expressed in the RDF syntax. These statements can be used to show how and where an object fits in the ontological structure of the body of knowledge. Agent systems and Web services: software environments which can be built to explore, analyse and exploit the knowledge embedded in ontologies and RDF databases. Future directions What sorts of projects would be required? Transforming existing knowledge into Semantic Web forms: Assigning and maintaining URIs for objects; Transforming existing vocabularies into SKOS-type ontologies; Mapping between vocabularies (SKOS, OWL). Making these building-blocks available on the Web: Delivering ontologies for use over the Web; Building RDF databases (“triple stores”) out of existing manuscript descriptions; Embedding ontologies into these databases. Building Web services to exploit these data stores: Building query and browse services (e.g. using SPARQL); Enabling decentralized updating of these services. Who should be involved in such projects? How should such projects be organized and managed? Research groups and individual researchers (to ensure relevance; test, correct and update; supply data and expertise); Libraries and other collecting institutions (to supply data and expertise); Technology experts (to design and build services); Commercial firms (to supply data and expertise). What sources of funding should be pursued? European Science Foundation; Other European Union funding sources; National schemes; Foundations and other non-government sources. European Science Foundation Exploratory Workshop, Birmingham (UK), 30 March – 1 April 2009 Terminology OWL: Web Ontology Language RDF: Resource Description Framework SKOS: Simple Knowledge Organization System SPARQL: a protocol and query language for RDF Triple: a subject–predicate–object expression in RDF URI: Uniform Resource Identifier References 1. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/MS.html 2. British Library Manuscripts Catalogue: http://www.bl.uk/catalogues/manuscripts/INDEX.asp 3. Codices Electronici Sangallenses: http://www.cesg.unifr.ch/en/ 4. Medieval Manuscripts in Dutch Collections: http://www.mmdc.nl/static/site/index.html 5. IRHT MEDIUM: http://www.irht.cnrs.fr/ressources/medium_frame.htm 6. Digital Scriptorium (US): http://www.scriptorium.columbia.edu/ 7. CERL Portal: http://cerl.epc.ub.uu.se/sportal/ 8. ENRICH/Manuscriptorium: http://enrich.manuscriptorium.com/ 9. Europeana: http://europeana.eu/portal/ 10. Dante, Divina Commedia: http://www.danteonline.it/italiano/codici_indice.htm 11. Chrétien de Troyes, Le Chevalier de la Charrette : http://lancelot.baylor.edu/ 12. Denis Muzerelle, Vocabulaire codicologique: http://vocabulaire.irht.cnrs.fr/vocab.htm 13. In Principio (Brepols) 14. International Medieval Bibliography (Brepols): subject thesaurus 15. International Medieval Bibliography (Brepols): author lists 16. Medieval Manuscripts in Dutch Collections: list of authors http://www.mmdc.nl/static/media/2/17/authors.pdf 17. Personennamen des Mittelalters (online as part of Personennamendatei): http://www.d-nb.de/eng/standardisierung/normdateien/pnd.htm 18. Europa Sacra (Brepols) 19. Fasti Ecclesiae Anglicanae 1066-1300: http://www.british-history.ac.uk/subject.aspx?subject=2&gid=39 20. CERL Thesaurus: http://cerl.sub.uni-goettingen.de/ct/ 21. Scriptorium indexes: http://www.scriptorium.be/en/frameset2.htm 22. International Medieval Bibliography (Brepols): manuscripts index