The Closed World of Databases meets the Open World of the Semantic Web Thursday 12th October 2006 National e-Science Centre, Edinburgh Data Webs: new visions for research data on the Web David Shotton Image BioInformatics Research Group Oxford e-Research Centre and Department of Zoology University of Oxford, UK e-mail: david.shotton@zoo.ox.ac.uk Outline The central problem: How best to integrate distributed research data The World Wide Web as we know it The Deep Web Database integration activities in practice Semantic Web tools for database access Data webs – a lightweight Semantic Web approach ImageWeb – a data web for research images The ImageWeb Consortium Requirements analysis for image integration using data webs Conclusion The Web provides documents for humans to read The World Wide Web is familiar as an environment in which document publication is cheap and easy, and linking between documents is trivial It works because of two fundamental technologies: ¾ Hypertext Transport Protocol (http) for data packet distribution ¾ Hypertext Markup Language (HTML) for Web page formatting Search engines make the content of (most) Web pages available to us But the Web is inflexible, since HTML conveys no meaning about the text it marks up Differences in data presentation formats make collating information from multiple web pages hard for humans, and well nigh impossible for machines The ‘Hidden’ or ‘Deep’ Web Google indexes only a small proportion of the information available on the Web Most data are stored in relational databases that are opaque to search spiders Instead, these data sources produce their results dynamically, in response to direct requests ¾ Externally, using a bespoke browser interface or a Web Services interface ¾ Internally, using SQL In the context of this meeting, such data should not be regarded as ‘legacy’ data – relational databases will be with us for future decades, with new ones being developed Repeated daily access to the content of many different databases is very important for people working in a large number of application areas, including ¾ the life sciences and pharmaceuticals – bioinformatics ‘in silico’ research ¾ e-Government ¾ financial institutions . . . and there are lots of databases ! Database access Databases differs from one another in both structure and semantics ¾ What is an “employee ID” in one database is a “payroll number” in another There is a general lack of standards for formats and data types Using complex database interfaces effectively is not easy for a newcomer, often demanding a high degree of familiarity and tacit knowledge As a result, people tend to limit their use to a small number of key databases Database structures and interfaces are subject to modification and upgrading without warning, jeopardising ‘screen-scraping’ methods for automated data harvesting Cross-searching or integration between these distributed heterogeneous information resources is currently well-nigh impossible Database integration – the conventional approach Data base Data base RDF data Data base RSS feed World Wide Web Data integration? ¾ Separate resources are searched independently and sequentially Research user ¾ Information is downloaded as required ¾ ‘Data integration’ is usually limited to cutting and pasting into a Word document ! The Semantic Web Tim Berners-Lee’s vision of “the Web of data” The Semantic Web extends the Web by providing a data representation with ¾ both syntactic consistency and a semantic framework ¾ enabling both interoperability and computational inferencing It involves three technologies, each resting hierarchically on the previous one: ¾ The eXtenstible Markup Language (XML) for tag definition, with XML Schema providing syntactical structure ¾ The Resource Description Framework (RDF) for making simple logical statements (subject-predicate-object) describing the relationships between entities, with RDF Schema providing semantic structure ¾ The Web Ontology Language OWL to encode the supporting ontologies that provide semantic definitions of the entities The Semantic Web’s Killer App To realise the vision of the “Web of Data”, there is a growing need for RDF applications that can access the content of huge live heterogeneous distributed non-RDF relational databases, making them cross-searchable and interoperable Indeed, database integration will be the Semantic Web’s killer app It is simply not feasible to replicate each existing database into RDF Rather, various SQL ⇋ RDF ‘bridges’ are being developed We favour a combination of ¾ SPARQL, the W3C query language for RDF, and ¾ Chris Bizer’s developments of D2R (Database to RDF) applications SPARQL – the query language for RDF RDF triples can come from a variety of sources ¾ directly from an RDF document ¾ inferred from other RDF triples ¾ as RDF expressions of data stored in other formats, such as XML or relational databases SPARQL is a standardized protocol and query language for such RDF data ¾ W3C Candidate Recommendation, 6 April 2006 ¾ http://www.w3.org/TR/rdf-sparql-query/ ¾ Largely the work of Eric Prud'hommeaux (W3C) and Andy Seaborne (Semantic Web Group, Hewlett Packard Research Laboratories, Bristol) SPARQL provides facilities to: ¾ extract information in the form of URIs, blank nodes, plain and typed literals ¾ extract RDF subgraphs ¾ construct new RDF graphs based on information in the queried graphs Chris Bizer’s work: 1 - D2RQ D2RQ permits non-RDF relational databases to be treated as read-only virtual RDF Graphs ¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2rq/index.htm ¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/spec/ The D2RQ Mapping Language is a declarative mapping language for describing the relation between relational database schemata and OWL / RDFS ontologies ¾ It generates an RDF graph for the relationships between the data objects in each database, using the table names as class names, and the column names as property names ¾ This permits OWL and RDFS inferencing over the content of the database Chris Bizer’s work: 2 - D2R Server “A tool for publishing relational databases on the Semantic Web” ¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/index.html The server enables RDF Semantic Web browsers such as PiggyBank to navigate the content of non-RDF databases as well as RDF resources It also permits HTML browsers to navigate the content of non-RDF databases, in this case by generating XHTML representations Furthermore, using D2RQ mappings, it permits applications to query the database directly using the SPARQL query language 1 3 2 Biological data – Universals and particulars Biological results may represent ‘universal truths’, such as the sequence of a particular gene ¾ These form bounded data sets ¾ The data need only be discovered once ¾ Such information is typically published in a large global bioinformatics database Research data can also be ‘particulars’ rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos ¾ These data form unbounded data sets ¾ Data collection will never be complete ¾ Such data are not widely available on line Where are research images published? Most are never published – in this digital age, this is a waste of research resources Some appear as figures in published journal articles, or as ‘supplementary data’ Others are housed in distributed specialist databases ¾ e.g. our new BBSRC-funded Drosophila Testis Gene Expression Database (http://www.fly-ted.org) which is built on our BioImage Database system (http://www.bioimage.org) Yet others are held in institutional repositories or specialist research collections Revising our original database population model Image database population typically involves specific acts of data submission, whereby users provide metadata describing their images and a thumbnail image to assist user browsing The exact location of the original high-resolution digital image file is less important ¾ It can be submitted to the database with the metadata ¾ Alternatively, if more convenient, it can sit on a secure server elsewhere, to which it can be hyperlinked But wait a minute – why is it even necessary to submit the metadata? Why not just harvest metadata from the Web? With this thought, the concept of the data web was born Data webs – a lightweight approach to data integration The data web concept is a novel concept for digital information storage and integration using Semantic Web tools The data are NOT submitted to a central database, but are simply published by the data providers with appropriate metadata on their own Web servers Then, separately for each data web serving a particular knowledge domain, lightweight software tools are used to harvest, marshal and index metadata describing the distributed data into a central ontology-enabled data web registry By requiring only core metadata, conforming to a specific minimalist data web ontology, each data web overcomes the problems caused by syntactic and semantic differences in data presentation between providers, and makes collating selected information from multiple web sites possible for machines Data webs represent a step towards Berners-Lee’s vision of the World Wide Web as a ‘Web of Data’ Data webs are particularly suitable for integrating data that represent research ‘particulars’ The traditional database philosophy In a database, the data are central, surrounded by submission, query and access protocols Protocols Metadata/ schema Submission Access Query . . . and the data web philosophy A data web can be thought of as an ‘inside out’ database Query The data web – metadata collection and composition Data base RDF data RSS feed Data base Data base Conversion module Conversion module World Wide Web Data web registry – data marshalling Ontologies Metadata acquisition ¾ Either direct, as RDF ¾ Or after adaptation, either ‘in house’ or using a conversion module hosted elsewhere, whereby the relational data is mapped to RDF Role of the central metadata registry The data web registry acts first as a data marshal, gathering, ordering and integrating the metadata from across the web into a single searchable RDF graph ¾ Remember: With RDF, integration comes for free! It then provides an integrated cross-searchable access point to all the data in the data web, with both human and programmatic access The data web registry adds value by providing interoperability and customizable search interfaces, with a rigorous semantic underpinning The primary data holders benefit by increased user traffic to their sites, while at the same time being able to maintain normal copyright and access control The primary metadata are never controlled by the data web registry, but are freely available on the Web for use by other presently unforeseen applications, including novel data mining, integration and analysis services The data web – metadata exposure and user query Data base RDF data Data base Data base RSS feed World Wide Web Data web user Data web registry – metadata exposure Metadata exposure permits user queries over the entire data web The data web model – user referral to source data Data base RDF data Data base Data base RSS feed World Wide Web Data web user Data web registry – referral to source Users are then referred to the original sources of data matching their queries Data web advantages A data web of this type will have all the advantages of the World Wide Web itself: ¾ Distributed data ¾ Freedom, decentralization and low cost of publication ¾ Lack of centralized control ¾ Built-in evolvability and scalability Data webs are Open – Open – Open – Open ¾ Support for open access data publication ¾ Use of open source software components ¾ Open standards for software and metadata development ¾ An Open World (“missing isn’t broken”) data philosophy How does a data webs differ from Google? A data web will provide for the selected data what Web search engines such as Google do for conventional Web pages, but with the following advantages: ¾ It permits database information hidden in the Deep Web to be accessed ¾ It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio ¾ It provides integration of information with ontological underpinning, semantic coherence, and truth maintenance ¾ It permits programmatic access, enabling further services to be built on top of one or more data webs Google Images search results Conventional ‘Google-like’ searching for images by means of exact keyword matching (here for “mouse”) gives results which are rather unpredictable! Hopefully, with some ontological semantic underpinning, search results will be more accurate The ImageWeb Project Purpose: To integrate and make cross-searchable research images held by publishers, in institutional repositories, and forming specialist research collections, which currently exist in isolated data silos ImageWeb should involve minimum effort on the part of the publishers and repository managers, who must be able to use their existing relational databases, XML metadata schemas or RSS feeds It requires harvesting of thumbnails and basic metadata describing the images Metadata from all the participating sites will be cross-searchable at the registry ImageWeb will permit owners of research image collections to publish their images in a way that can easily integrate with other research image collections ImageWeb will enable publishers’ web sites to become a more integral part of day-to-day research, and published images to be used more fully than at present ImageWeb – an (imperfect) real world analogy The local newspaper property section contains thumbnail images and basic metadata about houses for sale – equivalent to the ImageWeb registry Thumbnail image Minimal metadata Users searching this central ‘registry’ pick out what they like, and then . . . . . . go round to the estate agent’s office for full details! The ImageWeb Consortium Image BioInformatics Research Group, University of Oxford Leading commercial publishers ¾ Nature and Oxford University Press Leading Open Access publishers ¾ The Public Library of Science and BioMed Central University institutional repositories ¾ Universities of Cambridge, Imperial College, Oxford and Southampton Other stakeholders: British Library, CCLRC, UKOLN, ILRT, CrossRef, SPARC Professional biologists and academic biological image collections The first step – requirements analysis “Defining Image Access: Requirements for interoperable discovery and delivery of image data stored in DSpace, EPrints and Fedora-based institutional repositories using a data web approach” A six-month requirements analysis project funded by the JISC, as part of the Discovery to Delivery strand of their Repositories and Preservation Programme It involves my Image Bioinformatics Research Group with the following partners ¾ Repository partners at Cambridge, Imperial College, Oxford and Southampton ¾ UKOLN, Digital Repositories Programme Support Team ¾ CCLRC e-Science Centre Its deliverable will be a report ¾ detailing the findings and conclusions from our investigations ¾ recommending best practices for adoption to enhance image interoperability between institutional repositories ¾ providing repository implementation guidelines for the creation of data webs ¾ identifying open source software systems that provide data web functionality Conclusion – the future of Web publishing With the advent of the Semantic Web, the possibility exists for centralized databases to give way to a new paradigm where everyone publishes their own research results We are entering the age of distributed personal data publication ¾ Most research data will in future not be submitted to centralized databases ¾ Rather, data will be published locally by individuals, institutional repositories or journals, complete with semantically rich metadata ¾ This form of distributed publication is most appropriate for ‘particulars’ data such as biological research images, where submission to a central database does not scale Use of lightweight Semantic Web technologies will enable isolated distributed repositories and databases to be unified into cross-searchable data webs Acknowledgements Chris Catton Graham Klyne N.B. We are looking for a new researcher to join the Oxford Image Bioinformatics Research Group this autumn Enquiries and CVs to david.shotton@zoo.ox.ac.uk. The end