Data Webs: new visions for research data on the Web David Shotton

advertisement
The Closed World of Databases meets the
Open World of the Semantic Web
Thursday 12th October 2006
National e-Science Centre, Edinburgh
Data Webs:
new visions for research data on the Web
David Shotton
Image BioInformatics Research Group
Oxford e-Research Centre and
Department of Zoology
University of Oxford, UK
e-mail: david.shotton@zoo.ox.ac.uk
Outline
The central problem: How best to integrate distributed research data
„
The World Wide Web as we know it
„
The Deep Web
„
Database integration activities in practice
„
Semantic Web tools for database access
„
Data webs – a lightweight Semantic Web approach
„
ImageWeb – a data web for research images
„
The ImageWeb Consortium
„
Requirements analysis for image integration using data webs
„
Conclusion
The Web provides documents for humans to read
„
The World Wide Web is familiar as an environment in which document
publication is cheap and easy, and linking between documents is trivial
„
It works because of two fundamental technologies:
¾ Hypertext Transport Protocol (http) for data packet distribution
¾ Hypertext Markup Language (HTML) for Web page formatting
„
Search engines make the content of (most) Web pages available to us
„
But the Web is inflexible, since HTML conveys no meaning about the text it
marks up
„
Differences in data presentation formats make collating information from
multiple web pages hard for humans, and well nigh impossible for machines
The ‘Hidden’ or ‘Deep’ Web
„
Google indexes only a small proportion of the information available on the Web
„
Most data are stored in relational databases that are opaque to search spiders
„
Instead, these data sources produce their results dynamically, in response to
direct requests
¾ Externally, using a bespoke browser interface or a Web Services interface
¾ Internally, using SQL
„
In the context of this meeting, such data should not be regarded as ‘legacy’ data
– relational databases will be with us for future decades, with new ones being
developed
„
Repeated daily access to the content of many different databases is very
important for people working in a large number of application areas, including
¾ the life sciences and pharmaceuticals – bioinformatics ‘in silico’ research
¾ e-Government
¾ financial institutions
. . . and there are lots of databases !
Database access
„
Databases differs from one another in both structure and semantics
¾ What is an “employee ID” in one database is a “payroll number” in another
„
There is a general lack of standards for formats and data types
„
Using complex database interfaces effectively is not easy for a newcomer, often
demanding a high degree of familiarity and tacit knowledge
„
As a result, people tend to limit their use to a small number of key databases
„
Database structures and interfaces are subject to modification and upgrading
without warning, jeopardising ‘screen-scraping’ methods for automated data
harvesting
„
Cross-searching or integration between these distributed heterogeneous
information resources is currently well-nigh impossible
Database integration – the conventional approach
Data
base
Data
base
RDF
data
Data
base
RSS
feed
World Wide Web
„
Data integration?
¾ Separate resources are searched
independently and sequentially
Research
user
¾ Information is downloaded as required
¾ ‘Data integration’ is usually limited to
cutting and pasting into a Word document !
The Semantic Web
„
Tim Berners-Lee’s vision of “the Web of data”
„
The Semantic Web extends the Web by providing a data representation with
¾ both syntactic consistency and a semantic framework
¾ enabling both interoperability and computational inferencing
„
It involves three technologies, each resting hierarchically on the previous one:
¾ The eXtenstible Markup Language (XML) for tag definition, with
XML Schema providing syntactical structure
¾ The Resource Description Framework (RDF) for making simple logical
statements (subject-predicate-object) describing the relationships between
entities, with RDF Schema providing semantic structure
¾ The Web Ontology Language OWL to encode the supporting ontologies
that provide semantic definitions of the entities
The Semantic Web’s Killer App
„
To realise the vision of the “Web of Data”, there is a growing need for RDF
applications that can access the content of huge live heterogeneous distributed
non-RDF relational databases, making them cross-searchable and interoperable
„
Indeed, database integration will be
the Semantic Web’s killer app
„
It is simply not feasible to replicate each existing database into RDF
„
Rather, various SQL ⇋ RDF ‘bridges’ are being developed
„
We favour a combination of
¾ SPARQL, the W3C query language for RDF, and
¾ Chris Bizer’s developments of D2R (Database to RDF) applications
SPARQL – the query language for RDF
„
RDF triples can come from a variety of sources
¾ directly from an RDF document
¾ inferred from other RDF triples
¾ as RDF expressions of data stored in other formats, such as XML or
relational databases
„
SPARQL is a standardized protocol and query language for such RDF data
¾ W3C Candidate Recommendation, 6 April 2006
¾ http://www.w3.org/TR/rdf-sparql-query/
¾ Largely the work of Eric Prud'hommeaux (W3C) and Andy Seaborne
(Semantic Web Group, Hewlett Packard Research Laboratories, Bristol)
„
SPARQL provides facilities to:
¾ extract information in the form of URIs, blank nodes, plain and typed literals
¾ extract RDF subgraphs
¾ construct new RDF graphs based on information in the queried graphs
Chris Bizer’s work: 1 - D2RQ
„
D2RQ permits non-RDF relational databases to be treated as
read-only virtual RDF Graphs
¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2rq/index.htm
¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/spec/
„
The D2RQ Mapping Language is a declarative mapping language for describing
the relation between relational database schemata and OWL / RDFS ontologies
¾ It generates an RDF graph
for the relationships
between the data objects
in each database, using
the table names as class
names, and the column
names as property names
¾ This permits OWL and RDFS inferencing over the content of the database
Chris Bizer’s work: 2 - D2R Server
„
“A tool for publishing relational databases on the Semantic Web”
¾ http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/index.html
„
The server enables RDF Semantic Web browsers such as PiggyBank to
navigate the content of non-RDF databases as well as RDF resources
„
It also permits HTML browsers to navigate the content of non-RDF
databases, in this case by generating XHTML representations
„
Furthermore, using D2RQ mappings, it permits applications to query the
database directly using the SPARQL query language
1
3
2
Biological data – Universals and particulars
„
Biological results may represent ‘universal truths’,
such as the sequence of a particular gene
¾ These form bounded data sets
¾ The data need only be discovered once
¾ Such information is typically published in a
large global bioinformatics database
„
Research data can also be ‘particulars’ rather
than ‘universals’, for example individual assay
results, microscopy images and wildlife photos
¾ These data form unbounded data sets
¾ Data collection will never be complete
¾ Such data are not widely available on line
Where are research images published?
„
Most are never published – in this digital age, this is a waste of research resources
„
Some appear as figures in published journal articles, or as ‘supplementary data’
„
Others are housed in distributed specialist databases
¾ e.g. our new BBSRC-funded Drosophila Testis Gene
Expression Database (http://www.fly-ted.org)
which is built on our BioImage Database system (http://www.bioimage.org)
„
Yet others are held in institutional repositories or specialist research collections
Revising our original database population model
„
Image database population typically involves specific acts of data submission,
whereby users provide metadata describing their images and a thumbnail image
to assist user browsing
„
The exact location of the original high-resolution digital image file is less important
¾ It can be submitted to the database with the metadata
¾ Alternatively, if more convenient, it can sit on a secure server elsewhere, to
which it can be hyperlinked
„
But wait a minute – why is it even
necessary to submit the metadata?
„
Why not just harvest metadata
from the Web?
„
With this thought, the concept of
the data web was born
Data webs – a lightweight approach to data integration
The data web concept is a novel concept for
digital information storage and integration
using Semantic Web tools
„
The data are NOT submitted to a central database, but are simply published by
the data providers with appropriate metadata on their own Web servers
„
Then, separately for each data web serving a particular knowledge domain,
lightweight software tools are used to harvest, marshal and index metadata
describing the distributed data into a central ontology-enabled data web registry
„
By requiring only core metadata, conforming to a specific minimalist data web
ontology, each data web overcomes the problems caused by syntactic and
semantic differences in data presentation between providers, and makes
collating selected information from multiple web sites possible for machines
„
Data webs represent a step towards Berners-Lee’s vision of the World Wide
Web as a ‘Web of Data’
„
Data webs are particularly suitable for integrating data that represent research
‘particulars’
The traditional database philosophy
„
In a database, the data are central, surrounded by submission, query and
access protocols
Protocols
Metadata/
schema
Submission
Access
Query
. . . and the data web philosophy
„
A data web can be thought of as an ‘inside out’ database
Query
The data web – metadata collection and composition
Data
base
RDF
data
RSS
feed
Data
base
Data
base
Conversion
module
Conversion
module
World Wide Web
Data web registry
– data marshalling
„
Ontologies
Metadata acquisition
¾ Either direct, as RDF
¾ Or after adaptation, either ‘in house’ or using a conversion module
hosted elsewhere, whereby the relational data is mapped to RDF
Role of the central metadata registry
„
The data web registry acts first as a data marshal,
gathering, ordering and integrating the metadata from
across the web into a single searchable RDF graph
¾ Remember: With RDF, integration comes for free!
„
It then provides an integrated cross-searchable access point to all the data in
the data web, with both human and programmatic access
„
The data web registry adds value by providing interoperability and
customizable search interfaces, with a rigorous semantic underpinning
„
The primary data holders benefit by increased user traffic to their sites, while
at the same time being able to maintain normal copyright and access control
„
The primary metadata are never controlled by the data web registry, but are
freely available on the Web for use by other presently unforeseen applications,
including novel data mining, integration and analysis services
The data web – metadata exposure and user query
Data
base
RDF
data
Data
base
Data
base
RSS
feed
World Wide Web
Data web
user
„
Data web registry –
metadata exposure
Metadata exposure permits user queries over the entire data web
The data web model – user referral to source data
Data
base
RDF
data
Data
base
Data
base
RSS
feed
World Wide Web
Data web
user
„
Data web registry
– referral to source
Users are then referred to the original sources of data matching their queries
Data web advantages
„
A data web of this type will have all the advantages of
the World Wide Web itself:
¾ Distributed data
¾ Freedom, decentralization and low cost of publication
¾ Lack of centralized control
¾ Built-in evolvability and scalability
„
Data webs are Open – Open – Open – Open
¾ Support for open access data publication
¾ Use of open source software components
¾ Open standards for software and metadata development
¾ An Open World (“missing isn’t broken”) data philosophy
How does a data webs differ from Google?
„
A data web will provide for the selected data what Web search engines
such as Google do for conventional Web pages, but with the following
advantages:
¾ It permits database information hidden in the Deep Web to be accessed
¾ It involves specific targeting to a particular knowledge domain, thus
achieving a significantly higher signal-to-noise ratio
¾ It provides integration of information with ontological underpinning,
semantic coherence, and truth maintenance
¾ It permits programmatic access, enabling further services to be built on
top of one or more data webs
Google Images search results
„
Conventional ‘Google-like’ searching for images by means of exact keyword
matching (here for “mouse”) gives results which are rather unpredictable!
Hopefully, with some
ontological semantic
underpinning, search
results will be more
accurate
The ImageWeb Project
„
Purpose: To integrate and make cross-searchable research images held by
publishers, in institutional repositories, and forming specialist research
collections, which currently exist in isolated data silos
„
ImageWeb should involve minimum effort on the part of the publishers and
repository managers, who must be able to use their existing relational
databases, XML metadata schemas or RSS feeds
„
It requires harvesting of thumbnails and basic metadata describing the images
„
Metadata from all the participating sites will be cross-searchable at the registry
„
ImageWeb will permit owners of research image collections to publish their
images in a way that can easily integrate with other research image collections
„
ImageWeb will enable publishers’ web sites to become a more integral part of
day-to-day research, and published images to be used more fully than at present
ImageWeb – an (imperfect) real world analogy
„
The local newspaper property section contains thumbnail images and basic
metadata about houses for sale – equivalent to the ImageWeb registry
Thumbnail image
„
Minimal metadata
Users searching this central ‘registry’ pick out what they like, and then . . .
. . . go round to the estate agent’s office for full details!
The ImageWeb Consortium
„
Image BioInformatics Research Group, University of Oxford
„
Leading commercial publishers
¾ Nature and Oxford University Press
„
Leading Open Access publishers
¾ The Public Library of Science and BioMed Central
„
University institutional repositories
¾ Universities of Cambridge, Imperial College, Oxford and Southampton
„
Other stakeholders: British Library, CCLRC, UKOLN, ILRT, CrossRef, SPARC
„
Professional biologists and
academic biological image collections
The first step – requirements analysis
„
“Defining Image Access: Requirements for interoperable discovery and delivery
of image data stored in DSpace, EPrints and Fedora-based institutional
repositories using a data web approach”
„
A six-month requirements analysis project funded by the JISC, as part of the
Discovery to Delivery strand of their Repositories and Preservation Programme
„
It involves my Image Bioinformatics Research Group with the following partners
¾ Repository partners at Cambridge, Imperial College, Oxford and Southampton
¾ UKOLN, Digital Repositories Programme Support Team
¾ CCLRC e-Science Centre
„
Its deliverable will be a report
¾ detailing the findings and conclusions from our investigations
¾ recommending best practices for adoption to enhance image interoperability
between institutional repositories
¾ providing repository implementation guidelines for the creation of data webs
¾ identifying open source software systems that provide data web functionality
Conclusion – the future of Web publishing
„
With the advent of the Semantic Web, the possibility exists for centralized
databases to give way to a new paradigm where everyone publishes their
own research results
„
We are entering the age of distributed personal data publication
¾ Most research data will in future not be submitted to centralized
databases
¾ Rather, data will be published locally by individuals, institutional
repositories or journals, complete with semantically rich metadata
¾ This form of distributed publication is most appropriate for ‘particulars’
data such as biological research images, where submission to a central
database does not scale
„
Use of lightweight Semantic Web technologies will enable
isolated distributed repositories and databases to be
unified into cross-searchable data webs
Acknowledgements
Chris Catton
Graham Klyne
N.B.
We are looking for a new researcher to join the
Oxford Image Bioinformatics Research Group this autumn
Enquiries and CVs to david.shotton@zoo.ox.ac.uk.
The end
Download