“Star Trek” and Linked Data Project.

advertisement
Background Paper:
Cochrane Linked Data Project: From “Star Trek” to the
present
November 2012
Prepared by
Chris Mavergames, Director of Web Development, Cochrane Web Team
Lorne Becker, Cochrane Web Team and Cochrane Innovations
For the attention of
Anyone who’s interested.
Structure
1.
2.
3.
4.
5.
6.
Background
User stories and user research
The demonstrator
Future plans
Further reading and viewing
Glossary
1. Background
Beginning in May of 2011, members of the Web Team, Cochrane Editorial Unit (CEU), IMS
Team, Wiley and consultants from Ontoba held the first meeting to explore the use of
semantic web technologies to enable more dynamic use of Cochrane content both within The
Cochrane Library and to allow Cochrane to connect to the “web of data” and forge potential
partnerships with those working in this new, online and technological context. Originally
dubbed the “Star Trek” project, due to its futuristic thinking, the project progressed
throughout the remainder of 2011 and in to 2012 mainly focusing on showing proof-ofconcept through an initial set of use cases. Then, from March 2012, the project entered its
second phase and became, officially, the Cochrane Linked Data project.
The original thinking and impetus behind this project grew out of the both developments on
the web in the area of linking data and injecting “meaning” and structure into content as well
as the reports from our various users via research done by Wiley and others that our users
would like to see other views of our content. “Thinking outside the container of the Review”,
the full-text PDF presentation that is our current standard, thus became the task. This
required “interrogating” our currently RevMan XML structure and to look out how this
structure could be improved or augmented to support doing interesting things with the
content such as providing new ways to browse and search the content and re-package it for
various users in various contexts.
Cochrane Reviews are great, but…

There are problems that limit their use by some people

Difficult to wade through all of the text

Difficult to understand the figures, terminology, and other bits of the Review

Hard to compare interventions without reading multiple Reviews

Moving from studies in CENTRAL to Reviews that included that study difficult

Can be difficult to find the Review you seek
Linked data, semantic web and Cochrane: The basics
The linked data approach allows the possibility for a machine (i.e. a computer program) to
“read” (really query) a web page or set of pages and return specific portions of interest to
the user. For example, a semantic web standard called “GoodRelations” using linked data
markup to enrich search results so that product details can be extracted and presented in
search results including photos, price, user reviews and ratings and other information that
the user can use to make their purchasing decision. Another example relates to display of
recipes in Google and other search engines. Display of recipe results in Google is also being
enhanced by linked data markup, for example. Google “New York Cheesecake recipe” and
you can see below that a photo, rating and preview of results appear:
But…




Machines aren‘t good at reading web pages because…
Data on the web is meant for human consumption
Machines need the data to be structured
Once structured, information can be more easily shared within datasets and across
web pages
Fortunately, Cochrane Reviews are structured – but we still need to teach the machines how
to read them, where to find data within them and how the data is related. The web is
moving from a web of documents to a “web of data”. Right now, the links on web pages are
between documents but the data and content within web pages and in databases is largely
devoid of any “meaning”. The semantic web and linked data are a way to move toward a
web of data that allows for more meaningful connections between things. See the “Further
reading and viewing” section for more info.
Cochrane Semantic Model
The semantic web technology stack (http://en.wikipedia.org/wiki/File:Semantic-webstack.png) uses ontologies (semantic models) to describe a domain. For example, Cochrane
Reviews can be described using OWL, the Web Ontology Language, and RDFS, RDF Schema,
to map the classes and relationships of the various components. So, a Review includes a
number of studies and each study may have, for example, a risk of bias assessment in a
Review. Once these concepts and relationships are made explicit, a machine can then
“understand” the underlying content. Using an ontology with data in RDF (Resource
Description Framework) format, a simple data model that uses “triples” to store information
that is query-able against a given ontology or set of ontologies that describe the data.
Here is a simple example:
RDF stores data in triples: Subject -> Property -> Object
This is the way humans think as well, in sentences.
<Gerd Antes> hasRole <Director German Ctr>
<Director German Ctr> worksIn <Freiburg, Germany>
<Gerd Antes> worksIn <Freiburg, Germany>
So, given the first 2 statements, the machine could infer the 3rd statement.
We have created an ontology, a semantic model, for Cochrane Reviews and studies. Latest
version can be found here: http://data.cochrane.org/ontologies/review/. It is still a work-inprogress and needs to be evaluated and tested to be sure the inferences it makes are
consistent with Cochrane methods and that it can fulfill the use cases and thus the needs of
our various end-users.
2. User stories and user research
Projects already conducted by Wiley and Cochrane have indicated that end users would like
to find and view Cochrane Content in a variety of different ways, and developers of new
products for The Cochrane Library would like to be able to select and manipulate sections of
Cochrane Reviews for repackaging into new products. The use of improved XML structure
and semantic web technologies could facilitate the delivery of “dynamic Cochrane content”.
From this research and other thinking within the Linked Data Project and within the RevMan
Advisory Committee (RAC) and other groups within Cochrane, we have developed lists of
“user stories” that inform larger sets of use cases. Using industry techniques we learned
from our consultants, Ontoba, we have used various rubrics and tools to arrive at and
describe these user stories and use cases.
‘So that…’ phrases
One way to capture user stories is to use the “So that…” framework to describe what people
want to do with your content, on your website, etc. You translate desired features into the
form: “As a xxx, I want to be able to yyy, so that I can zzz.” Here are some examples from
the Linked Data Project:
1.
As a ‘XXX’, I want to see all the information about a study in CRS, so that...
‘Clinician’: I can see if the paper is relevant to my clinical question, before reading
the full report.
●
‘Systematic reviewer’: I can screen the paper to see if it is relevant to my review,
without having to read the full report.
● ‘Anybody’: So that I can easily compare the characteristics of studies, as the CRS
format is common across entries.
●
2.
As a ‘XXX’, I want to see all risk of bias analysis conducted on a study, so that...
● ‘Clinician’: I can see if the study is biased, and the results trustworthy.
● ‘Clinician’: I can see if there are differing opinions on the biases in the study from
different authors, and this may help me reach my own conclusions about whether
or not I think the study is biased.
● ‘Systematic Reviewer’: I can identify whether someone has already done the work
of assessing the risk of bias of a study, and this may save me time. I could use
the information as a starting point and amend if I think it is needed for my own
review, or I could use the information after I have performed my own
assessments to see how they differ.
From groups of user stories, we are able to build out use cases that can inform potential
prototypes of functionality for use on our websites. One example is the idea of an “Asthma
Super Centre”, or browser of the evidence on Asthma. Another one we’ve been working is a
CENTRAL demonstrator that shows the power of linking between studies and Reviews and
the information in Reviews about studies they evaluate.
User stories and use cases
For the current linked data project, we have been focusing on two sorts of user stories. One
is the idea of an “Asthma Super Centre”, or browser of the Cochrane evidence on Asthma
that would address our perception that users would like to find and view Cochrane Content
in a variety of different ways. The generic user story for this section has been the following:
As a reader of Cochrane Reviews, I would like to:
1. Filter reviews by selected parameters to show me the subset most relevant to me
2. Display selected portions of those reviews in a format that works for me
3. Link out to selected content (both Cochrane & non-Cochrane) that would
enhance the usefulness of the review material
The second focus for the linked data project has been a CRS-CDSR demonstrator that
explores the potential for linking between studies and Reviews and the information in
Reviews about studies they evaluate. The primary user story that we have been addressing
in this section is:
As a Cochrane author who has identified a single trial report that is relevant to my review, I
would like to:
1. See what other published reports from the same trial have been identified in the
“studified” data in CRS
2. See which other Cochrane Reviews have this as one of their included studies
3. See the Risk of Bias appraisals of this study from those other Reviews
…so that I can improve my review by using the work that others in the Collaboration have
already done.
3. The demonstrator
As part of phase 2 of the Cochrane Linked Data Project, the Web Team, CEU and Ontoba
have created a demonstrator site in which we can build out these initial use cases and where
we can have a “sandbox” for demonstrating the power of using linked data with Cochrane
and other external content. At present, the demonstrator only includes a subset of the
asthma reviews produced by the Airways group. The demonstrator is at
http://demonstrator.dev.cochrane.org and has functionality that relates to both the Asthma
Supcercenter and the CRS/CDSR user stories.
1. Searching Reviews by drug name. Currently, there is no cross-indexing against
variant names of drugs in Cochrane Reviews. We have linked to Drugbank
(http://www.drugbank.ca/) which includes most of the variants of drug names
including the different brand names and generic names used in different countries.
We have created a “semantic search” that allows users to type any name for an
asthma medication and find the relevant Cochrane Reviews. See:
http://demonstrator.dev.cochrane.org/interventions. This functionality would greatly
improve the discoverability of Cochrane content in The Cochrane Library as now, for
example, if you search for “Prozac” you get zero results, but if you search for
“fluoxetine” you get 30 results.
2. Displaying selected portions of reviews. Clicking on any title on the “List of
Reviews” page in the demonstrator (http://demonstrator.dev.cochrane.org/reviews)
takes you to a custom view of that review that we have created by including sections
of the review suggested during the Strategic Discussion in Paris. This capability of
showing selected portions of a review, and rearranging their order could allow us in
future to devise different “views” for different user groups, to allow users to
customize their own Cochrane view by selecting the specific components and their
order, or to compare reviews by looking at components from 2 or more Reviews side
by side.
3. Linking out to selected content. In addition to linking to Drugbank as noted
above, we have linked to SIDER, a linked data set that includes information on side
effects from FDA label information (see http://sideeffects.embl.de/drugs/2153/) for
an example.
4. Finding which Cochrane Reviews have included a particular study. Each
review page in the demonstrator includes the list of included studies from the review,
with a link to a specific study page for each item on the list. Each study page
includes a list of all of the reviews (in our limited set) that have included that study
sing the unique study identifier from CRS and the links that CRS provides between
studies and Reviews (see
http://demonstrator.dev.cochrane.org/studies/revman/002304061509242379-STDO_x0027_Byrne-2005 for an example).
5. See what other published reports from the same trial have been identified
in the “studified” data in CRS. Once again using the links with CRS, each study
page in the demonstrator includes a list of all published reports from the study that
have been identified by Cochrane collaborators and either used in reviews or studified
in CRS.
6. See the Risk of Bias appraisals a single study from different Reviews. This
information is also included on each study page in the demonstrator. In some cases
(as in the O’Byrne example above), there is good agreement. Some other examples
have more variation
7. http://demonstrator.dev.cochrane.org/studies/revman/949204060709442762-STDKoopmans-2006 .
While the above examples are simple, they demonstrate and show the proof-of-concept of
this approach and, critically, the data in the “triple store” beneath this website is completely
dynamic. There are only ca. 40 Reviews on Asthma in there now but if we were to put all
Cochrane Reviews and their related studies in the linked data repository, the queries would
update automatically.
The technology behind the demonstrator
Demonstrator.dev.cochrane.org uses the Drupal open-source content management system
(CMS), the same system used to produce 130+ of the websites for The Cochrane
Collaboration. Drupal “plays nicely” with the semantic web stack including an RDFx module
and a very powerful module called SPARQL Views which allows for SPARQL queries to be
constructed within the core Drupal Views system. With our triple store linked data repository
software, OWLiM, running in the background at a canonical data.cochrane.org address and
server, we use Drupal and its RDF and SPARQL modules to quickly create a working website
for creating working prototypes that can be quickly styled using Drupal’s built-in theming and
templating system.
4. Future plans
Our experience with the linked data project to date has convinced us that it has potential to
become an “enabling technology” for the Collaboration that could allow us to do more with
our data. However, there are a number of issues that should be explored as we decide on
how best to integrate linked data within the Cochrane IT structure. These include:



Potential additional user stories
The technical architecture including implications for the IMS, Web Team, CRS and our
publisher of increased use of linked data
Adding structure and standardization to Cochrane reviews
Potential User Stories
Our success in realizing the relatively limited goals of the Linked Data Project to date has
encouraged us to look at additional user stories that might be addressed using this approach
– including several items on the RevMan wish list. For example, RevMan case # 119285
which says "Provide easy access from RevMan to relevant sections of other reviews using the
same studies via CRS. E.g. if you were completing the RoB table for a study you could easily
see how other authors have assessed the risk of bias for that study" is very similar to the
CRS/CDSR user story that we have been working on and case 122027 calls for "Interaction
between RevMan & CRS" without offering specific details.
Some of the user stories that might be explored using this approach include the following:
As a Managing Editor, or as a Cochrane Review user wishing to keep up with a specific area
of content, I want to see the date of publication for a subset of reviews (e.g. the set
included in an overview, article, guideline, etc) so that I can see if any have been updated
since I last looked at them. This came up as a specific request from an ME, but could easily
apply to writers of Cochrane overviews, guidelines, book chapters, etc.
Case 122010 - "Enable authors to generate a visual graphic highlighting each treatment
being compared in their review. Each node would be a treatment, each line at lest one RCT
with numbers corresponding to the number of RCTs.
Case 121023 - "Calculator for estimating power to overthrow current primary outcome. For
example, for a very potent intervention with high precision it may need a study total of
around 15,000 people with a neutral result to drag the findings back to being null. Should it
be a weak finding a trial of 100 may substantially change the result."
Technical architecture
This refers to how we would actually go about building all this out in reality within, alongside
or otherwise in our current systems, workflows and dataflows. An industry standard in
implementing semantic web and linked data technologies is to “not blow up the company”
but to innovate alongside existing tools and technologies to create a metadata store that
better describes the content but leave the existing content store(s) alone. But, we might
want to innovate in the authoring process and/or other parts of the content production
process as well. This is all still be to determined and will be discussed at the Linked Data
Project meeting in London from 4-6 December 2012.
Paul Wilton from Ontoba drew up a possible technical architecture diagram to provide us
with an example of one way we could consider:
Adding structure and standardization to Cochrane Reviews
The fact that Cochrane Reviews are very structured has been critical to the success of the
linked data project to date. However, this structure could be greatly improved by coding
some key elements in a standard way across reviews. For example, the only way to
determine which interventions have been included in a Review is to parse the text in the title
of each forest plot. A standardized way of coding the I and the C for each analysis would
improve the power and precision of linked data queries of CDSR. Ideally, all elements of the
Population, Interventions, Comparisons and Outcomes covered in the Cochrane Reviews
would be coded using some standard taxonomy.
Unfortunately, there is no currently existing taxonomy that adequately addresses this need,
although several widely used taxonomies could partially address our requirements. One
approach to this problem would be for the Collaboration to build on the various CRG topic
lists to develop a Cochrane taxonomy which would not be identical to any individual
taxonomy, but would mirror some specific portions of a handful of key taxonomies in a way
that will allow meaningful linkages to them.
The taxonomy could be built gradually by working with individual CRGs. The process has
already been initiated with the Airways group as part of the Cochrane linked-data project.
The CEU browse list would gradually evolve from its current structure to the new taxonomy.
As each CRG completed its section of the taxonomy, the relevant section of the CRG browse
would be replaced. The eventual result would be that the CEU browse would be completely
replaced by the new taxonomy, and each review would have only a single set of topics.
5. Further reading and viewing
Here are some presentations, videos and articles to provide further background on both
linked data and the semantic web as well as the work so far in Cochrane in the “Star Trek”
and Linked Data Project.
Presentations
Linked Data and Cochrane Reviews: A Report from the “Star Trek” Crew
Plenary talk by Chris Mavergames from Madrid Colloquium, October 2011
http://www.slideshare.net/mavergames/linked-data-and-cochrane-reviews-12936733
Sustainability and Cochrane Reviews: How Technology can Help
Plenary talk by Chris Mavergames from UK Contributors’ Meeting in Loughborough, March
2012
http://www.slideshare.net/mavergames/sustainability-and-cochrane-reviews-howtechnology-can-help-12207716
Web 3.0: The Semantic Web
http://www.slideshare.net/HatemMahmoud/web-30-the-semantic-web
Videos
Linked Data and the Web of Data
https://www.youtube.com/watch?v=GKfJ5onP5SQ
Intro to the Semantic Web
https://www.youtube.com/watch?v=OGg8A2zfWKg
The Semantic Web of Data
Tim Berners-Lee, inventor of the World Wide Web
https://www.youtube.com/watch?v=HeUrEh-nqtU
6. Glossary
Here is a glossary of terms related to linked data as well as a few related to Cochrane.
API
Application Programming Interface – allows different pieces of software to communicate.
CENTRAL
Cochrane Central Register of ControlLed Trials (Central)
Controlled vocabulary
Most-commonly known in indexing and cataloguing, controlled vocabularies use pre-defined,
specific and agreed-upon sets of terms for use in taxonomies, thesauri and other systems to
tag and organize content and data.
Drupal
An open-source Content Management System (CMS) – see http://drupal.org. The Cochrane
Web Team uses Drupal for the 130+ websites it manages.
Drupal themes
Layouts and designs for Drupal-based websites.
Drupal Views
A module in Drupal that is basically a GUI (Graphical User Interface) for querying the
database (MySQL) behind Drupal for displaying content on a website in more or less any
form you like.
GoodRelations
From http://www.heppnetz.de/projects/goodrelations/, “GoodRelations is the most powerful
vocabulary for publishing all of the details of your products and services in a way friendly to
search engines, mobile applications, and browser extensions. By adding a bit of extra code
to your Web content, you make sure that potential customers realize all the great features
and services and the benefits of doing business with you, because their computers can
extract and present this information with ease.”
Linked data
Part of the movement known as the Semantic Web or Web 3.0, linked data refers to a set of
concepts and standards for connecting data on the web and across data silos. See:
http://linkeddata.org/.
Linked Life Data
“A semantic data integration platform for the biomedical domain” - see
http://linkedlifedata.com. It includes the Unified Medical Language System (UMLS) which
includes SNOMED CT as well as Drugbank, both used in the Linked Data Project
demonstrator site at demonstrator.dev.cochrane.org
Metadata
Put simply, “data about data”. Data that describes your content.
Ontology
“An ontology is a specification of a conceptualization.” Ontologies in the semantic web are
used to describe a domain included the classes and properties and relationships between
things.
OWL
The Web Ontology Language. See: http://www.w3.org/TR/owl-features/.
OWLiM
A semantic repository software or “triple store” currently used in the Cochrane Linked Data
Project. See: http://www.ontotext.com/owlim.
RDF
Resource Description Framework. A data model for storing data in “triples”. See:
http://en.wikipedia.org/wiki/Resource_Description_Framework.
RDFS
RDF Schema language. http://en.wikipedia.org/wiki/RDF_Schema
Semantic Web
See videos above!
SNOMED CT
Systematized Nomenclature of Medicine -- Clinical Terms. A controlled vocabulary of medical
terms. See: http://en.wikipedia.org/wiki/SNOMED_CT.
SPARQL
SPARQL Protocol and RDF Query Language. The query language for querying data in RDF
format. See: http://en.wikipedia.org/wiki/SPARQL.
SPARQL Views
A Drupal module that integrates the SPARQL query languages with the Views module to
create displays of content on a website.
Taxonomy
Less formal way of creating a system to organize content. Note: there is substantial debate
the difference between ontologies, taxonomies, controlled vocabularies and thesauri!
Triples
RDF triples. In the RDF data model, data is stored as triples with a subject – predicate –
object. There are multiple serializations for RDF including RDF-XML, Turtle and N-3.
Triple store
From Wikipedia: A triplestore is a purpose-built database for the storage and retrieval of
triples,[1] a triple being a data entity composed of subject-predicate-object, like "Bob is 35" or
"Bob knows Fred". See: http://en.wikipedia.org/wiki/Triplestore.
Download