Background Paper: Cochrane Linked Data Project: From “Star Trek” to the present November 2012 Prepared by Chris Mavergames, Director of Web Development, Cochrane Web Team Lorne Becker, Cochrane Web Team and Cochrane Innovations For the attention of Anyone who’s interested. Structure 1. 2. 3. 4. 5. 6. Background User stories and user research The demonstrator Future plans Further reading and viewing Glossary 1. Background Beginning in May of 2011, members of the Web Team, Cochrane Editorial Unit (CEU), IMS Team, Wiley and consultants from Ontoba held the first meeting to explore the use of semantic web technologies to enable more dynamic use of Cochrane content both within The Cochrane Library and to allow Cochrane to connect to the “web of data” and forge potential partnerships with those working in this new, online and technological context. Originally dubbed the “Star Trek” project, due to its futuristic thinking, the project progressed throughout the remainder of 2011 and in to 2012 mainly focusing on showing proof-ofconcept through an initial set of use cases. Then, from March 2012, the project entered its second phase and became, officially, the Cochrane Linked Data project. The original thinking and impetus behind this project grew out of the both developments on the web in the area of linking data and injecting “meaning” and structure into content as well as the reports from our various users via research done by Wiley and others that our users would like to see other views of our content. “Thinking outside the container of the Review”, the full-text PDF presentation that is our current standard, thus became the task. This required “interrogating” our currently RevMan XML structure and to look out how this structure could be improved or augmented to support doing interesting things with the content such as providing new ways to browse and search the content and re-package it for various users in various contexts. Cochrane Reviews are great, but… There are problems that limit their use by some people Difficult to wade through all of the text Difficult to understand the figures, terminology, and other bits of the Review Hard to compare interventions without reading multiple Reviews Moving from studies in CENTRAL to Reviews that included that study difficult Can be difficult to find the Review you seek Linked data, semantic web and Cochrane: The basics The linked data approach allows the possibility for a machine (i.e. a computer program) to “read” (really query) a web page or set of pages and return specific portions of interest to the user. For example, a semantic web standard called “GoodRelations” using linked data markup to enrich search results so that product details can be extracted and presented in search results including photos, price, user reviews and ratings and other information that the user can use to make their purchasing decision. Another example relates to display of recipes in Google and other search engines. Display of recipe results in Google is also being enhanced by linked data markup, for example. Google “New York Cheesecake recipe” and you can see below that a photo, rating and preview of results appear: But… Machines aren‘t good at reading web pages because… Data on the web is meant for human consumption Machines need the data to be structured Once structured, information can be more easily shared within datasets and across web pages Fortunately, Cochrane Reviews are structured – but we still need to teach the machines how to read them, where to find data within them and how the data is related. The web is moving from a web of documents to a “web of data”. Right now, the links on web pages are between documents but the data and content within web pages and in databases is largely devoid of any “meaning”. The semantic web and linked data are a way to move toward a web of data that allows for more meaningful connections between things. See the “Further reading and viewing” section for more info. Cochrane Semantic Model The semantic web technology stack (http://en.wikipedia.org/wiki/File:Semantic-webstack.png) uses ontologies (semantic models) to describe a domain. For example, Cochrane Reviews can be described using OWL, the Web Ontology Language, and RDFS, RDF Schema, to map the classes and relationships of the various components. So, a Review includes a number of studies and each study may have, for example, a risk of bias assessment in a Review. Once these concepts and relationships are made explicit, a machine can then “understand” the underlying content. Using an ontology with data in RDF (Resource Description Framework) format, a simple data model that uses “triples” to store information that is query-able against a given ontology or set of ontologies that describe the data. Here is a simple example: RDF stores data in triples: Subject -> Property -> Object This is the way humans think as well, in sentences. <Gerd Antes> hasRole <Director German Ctr> <Director German Ctr> worksIn <Freiburg, Germany> <Gerd Antes> worksIn <Freiburg, Germany> So, given the first 2 statements, the machine could infer the 3rd statement. We have created an ontology, a semantic model, for Cochrane Reviews and studies. Latest version can be found here: http://data.cochrane.org/ontologies/review/. It is still a work-inprogress and needs to be evaluated and tested to be sure the inferences it makes are consistent with Cochrane methods and that it can fulfill the use cases and thus the needs of our various end-users. 2. User stories and user research Projects already conducted by Wiley and Cochrane have indicated that end users would like to find and view Cochrane Content in a variety of different ways, and developers of new products for The Cochrane Library would like to be able to select and manipulate sections of Cochrane Reviews for repackaging into new products. The use of improved XML structure and semantic web technologies could facilitate the delivery of “dynamic Cochrane content”. From this research and other thinking within the Linked Data Project and within the RevMan Advisory Committee (RAC) and other groups within Cochrane, we have developed lists of “user stories” that inform larger sets of use cases. Using industry techniques we learned from our consultants, Ontoba, we have used various rubrics and tools to arrive at and describe these user stories and use cases. ‘So that…’ phrases One way to capture user stories is to use the “So that…” framework to describe what people want to do with your content, on your website, etc. You translate desired features into the form: “As a xxx, I want to be able to yyy, so that I can zzz.” Here are some examples from the Linked Data Project: 1. As a ‘XXX’, I want to see all the information about a study in CRS, so that... ‘Clinician’: I can see if the paper is relevant to my clinical question, before reading the full report. ● ‘Systematic reviewer’: I can screen the paper to see if it is relevant to my review, without having to read the full report. ● ‘Anybody’: So that I can easily compare the characteristics of studies, as the CRS format is common across entries. ● 2. As a ‘XXX’, I want to see all risk of bias analysis conducted on a study, so that... ● ‘Clinician’: I can see if the study is biased, and the results trustworthy. ● ‘Clinician’: I can see if there are differing opinions on the biases in the study from different authors, and this may help me reach my own conclusions about whether or not I think the study is biased. ● ‘Systematic Reviewer’: I can identify whether someone has already done the work of assessing the risk of bias of a study, and this may save me time. I could use the information as a starting point and amend if I think it is needed for my own review, or I could use the information after I have performed my own assessments to see how they differ. From groups of user stories, we are able to build out use cases that can inform potential prototypes of functionality for use on our websites. One example is the idea of an “Asthma Super Centre”, or browser of the evidence on Asthma. Another one we’ve been working is a CENTRAL demonstrator that shows the power of linking between studies and Reviews and the information in Reviews about studies they evaluate. User stories and use cases For the current linked data project, we have been focusing on two sorts of user stories. One is the idea of an “Asthma Super Centre”, or browser of the Cochrane evidence on Asthma that would address our perception that users would like to find and view Cochrane Content in a variety of different ways. The generic user story for this section has been the following: As a reader of Cochrane Reviews, I would like to: 1. Filter reviews by selected parameters to show me the subset most relevant to me 2. Display selected portions of those reviews in a format that works for me 3. Link out to selected content (both Cochrane & non-Cochrane) that would enhance the usefulness of the review material The second focus for the linked data project has been a CRS-CDSR demonstrator that explores the potential for linking between studies and Reviews and the information in Reviews about studies they evaluate. The primary user story that we have been addressing in this section is: As a Cochrane author who has identified a single trial report that is relevant to my review, I would like to: 1. See what other published reports from the same trial have been identified in the “studified” data in CRS 2. See which other Cochrane Reviews have this as one of their included studies 3. See the Risk of Bias appraisals of this study from those other Reviews …so that I can improve my review by using the work that others in the Collaboration have already done. 3. The demonstrator As part of phase 2 of the Cochrane Linked Data Project, the Web Team, CEU and Ontoba have created a demonstrator site in which we can build out these initial use cases and where we can have a “sandbox” for demonstrating the power of using linked data with Cochrane and other external content. At present, the demonstrator only includes a subset of the asthma reviews produced by the Airways group. The demonstrator is at http://demonstrator.dev.cochrane.org and has functionality that relates to both the Asthma Supcercenter and the CRS/CDSR user stories. 1. Searching Reviews by drug name. Currently, there is no cross-indexing against variant names of drugs in Cochrane Reviews. We have linked to Drugbank (http://www.drugbank.ca/) which includes most of the variants of drug names including the different brand names and generic names used in different countries. We have created a “semantic search” that allows users to type any name for an asthma medication and find the relevant Cochrane Reviews. See: http://demonstrator.dev.cochrane.org/interventions. This functionality would greatly improve the discoverability of Cochrane content in The Cochrane Library as now, for example, if you search for “Prozac” you get zero results, but if you search for “fluoxetine” you get 30 results. 2. Displaying selected portions of reviews. Clicking on any title on the “List of Reviews” page in the demonstrator (http://demonstrator.dev.cochrane.org/reviews) takes you to a custom view of that review that we have created by including sections of the review suggested during the Strategic Discussion in Paris. This capability of showing selected portions of a review, and rearranging their order could allow us in future to devise different “views” for different user groups, to allow users to customize their own Cochrane view by selecting the specific components and their order, or to compare reviews by looking at components from 2 or more Reviews side by side. 3. Linking out to selected content. In addition to linking to Drugbank as noted above, we have linked to SIDER, a linked data set that includes information on side effects from FDA label information (see http://sideeffects.embl.de/drugs/2153/) for an example. 4. Finding which Cochrane Reviews have included a particular study. Each review page in the demonstrator includes the list of included studies from the review, with a link to a specific study page for each item on the list. Each study page includes a list of all of the reviews (in our limited set) that have included that study sing the unique study identifier from CRS and the links that CRS provides between studies and Reviews (see http://demonstrator.dev.cochrane.org/studies/revman/002304061509242379-STDO_x0027_Byrne-2005 for an example). 5. See what other published reports from the same trial have been identified in the “studified” data in CRS. Once again using the links with CRS, each study page in the demonstrator includes a list of all published reports from the study that have been identified by Cochrane collaborators and either used in reviews or studified in CRS. 6. See the Risk of Bias appraisals a single study from different Reviews. This information is also included on each study page in the demonstrator. In some cases (as in the O’Byrne example above), there is good agreement. Some other examples have more variation 7. http://demonstrator.dev.cochrane.org/studies/revman/949204060709442762-STDKoopmans-2006 . While the above examples are simple, they demonstrate and show the proof-of-concept of this approach and, critically, the data in the “triple store” beneath this website is completely dynamic. There are only ca. 40 Reviews on Asthma in there now but if we were to put all Cochrane Reviews and their related studies in the linked data repository, the queries would update automatically. The technology behind the demonstrator Demonstrator.dev.cochrane.org uses the Drupal open-source content management system (CMS), the same system used to produce 130+ of the websites for The Cochrane Collaboration. Drupal “plays nicely” with the semantic web stack including an RDFx module and a very powerful module called SPARQL Views which allows for SPARQL queries to be constructed within the core Drupal Views system. With our triple store linked data repository software, OWLiM, running in the background at a canonical data.cochrane.org address and server, we use Drupal and its RDF and SPARQL modules to quickly create a working website for creating working prototypes that can be quickly styled using Drupal’s built-in theming and templating system. 4. Future plans Our experience with the linked data project to date has convinced us that it has potential to become an “enabling technology” for the Collaboration that could allow us to do more with our data. However, there are a number of issues that should be explored as we decide on how best to integrate linked data within the Cochrane IT structure. These include: Potential additional user stories The technical architecture including implications for the IMS, Web Team, CRS and our publisher of increased use of linked data Adding structure and standardization to Cochrane reviews Potential User Stories Our success in realizing the relatively limited goals of the Linked Data Project to date has encouraged us to look at additional user stories that might be addressed using this approach – including several items on the RevMan wish list. For example, RevMan case # 119285 which says "Provide easy access from RevMan to relevant sections of other reviews using the same studies via CRS. E.g. if you were completing the RoB table for a study you could easily see how other authors have assessed the risk of bias for that study" is very similar to the CRS/CDSR user story that we have been working on and case 122027 calls for "Interaction between RevMan & CRS" without offering specific details. Some of the user stories that might be explored using this approach include the following: As a Managing Editor, or as a Cochrane Review user wishing to keep up with a specific area of content, I want to see the date of publication for a subset of reviews (e.g. the set included in an overview, article, guideline, etc) so that I can see if any have been updated since I last looked at them. This came up as a specific request from an ME, but could easily apply to writers of Cochrane overviews, guidelines, book chapters, etc. Case 122010 - "Enable authors to generate a visual graphic highlighting each treatment being compared in their review. Each node would be a treatment, each line at lest one RCT with numbers corresponding to the number of RCTs. Case 121023 - "Calculator for estimating power to overthrow current primary outcome. For example, for a very potent intervention with high precision it may need a study total of around 15,000 people with a neutral result to drag the findings back to being null. Should it be a weak finding a trial of 100 may substantially change the result." Technical architecture This refers to how we would actually go about building all this out in reality within, alongside or otherwise in our current systems, workflows and dataflows. An industry standard in implementing semantic web and linked data technologies is to “not blow up the company” but to innovate alongside existing tools and technologies to create a metadata store that better describes the content but leave the existing content store(s) alone. But, we might want to innovate in the authoring process and/or other parts of the content production process as well. This is all still be to determined and will be discussed at the Linked Data Project meeting in London from 4-6 December 2012. Paul Wilton from Ontoba drew up a possible technical architecture diagram to provide us with an example of one way we could consider: Adding structure and standardization to Cochrane Reviews The fact that Cochrane Reviews are very structured has been critical to the success of the linked data project to date. However, this structure could be greatly improved by coding some key elements in a standard way across reviews. For example, the only way to determine which interventions have been included in a Review is to parse the text in the title of each forest plot. A standardized way of coding the I and the C for each analysis would improve the power and precision of linked data queries of CDSR. Ideally, all elements of the Population, Interventions, Comparisons and Outcomes covered in the Cochrane Reviews would be coded using some standard taxonomy. Unfortunately, there is no currently existing taxonomy that adequately addresses this need, although several widely used taxonomies could partially address our requirements. One approach to this problem would be for the Collaboration to build on the various CRG topic lists to develop a Cochrane taxonomy which would not be identical to any individual taxonomy, but would mirror some specific portions of a handful of key taxonomies in a way that will allow meaningful linkages to them. The taxonomy could be built gradually by working with individual CRGs. The process has already been initiated with the Airways group as part of the Cochrane linked-data project. The CEU browse list would gradually evolve from its current structure to the new taxonomy. As each CRG completed its section of the taxonomy, the relevant section of the CRG browse would be replaced. The eventual result would be that the CEU browse would be completely replaced by the new taxonomy, and each review would have only a single set of topics. 5. Further reading and viewing Here are some presentations, videos and articles to provide further background on both linked data and the semantic web as well as the work so far in Cochrane in the “Star Trek” and Linked Data Project. Presentations Linked Data and Cochrane Reviews: A Report from the “Star Trek” Crew Plenary talk by Chris Mavergames from Madrid Colloquium, October 2011 http://www.slideshare.net/mavergames/linked-data-and-cochrane-reviews-12936733 Sustainability and Cochrane Reviews: How Technology can Help Plenary talk by Chris Mavergames from UK Contributors’ Meeting in Loughborough, March 2012 http://www.slideshare.net/mavergames/sustainability-and-cochrane-reviews-howtechnology-can-help-12207716 Web 3.0: The Semantic Web http://www.slideshare.net/HatemMahmoud/web-30-the-semantic-web Videos Linked Data and the Web of Data https://www.youtube.com/watch?v=GKfJ5onP5SQ Intro to the Semantic Web https://www.youtube.com/watch?v=OGg8A2zfWKg The Semantic Web of Data Tim Berners-Lee, inventor of the World Wide Web https://www.youtube.com/watch?v=HeUrEh-nqtU 6. Glossary Here is a glossary of terms related to linked data as well as a few related to Cochrane. API Application Programming Interface – allows different pieces of software to communicate. CENTRAL Cochrane Central Register of ControlLed Trials (Central) Controlled vocabulary Most-commonly known in indexing and cataloguing, controlled vocabularies use pre-defined, specific and agreed-upon sets of terms for use in taxonomies, thesauri and other systems to tag and organize content and data. Drupal An open-source Content Management System (CMS) – see http://drupal.org. The Cochrane Web Team uses Drupal for the 130+ websites it manages. Drupal themes Layouts and designs for Drupal-based websites. Drupal Views A module in Drupal that is basically a GUI (Graphical User Interface) for querying the database (MySQL) behind Drupal for displaying content on a website in more or less any form you like. GoodRelations From http://www.heppnetz.de/projects/goodrelations/, “GoodRelations is the most powerful vocabulary for publishing all of the details of your products and services in a way friendly to search engines, mobile applications, and browser extensions. By adding a bit of extra code to your Web content, you make sure that potential customers realize all the great features and services and the benefits of doing business with you, because their computers can extract and present this information with ease.” Linked data Part of the movement known as the Semantic Web or Web 3.0, linked data refers to a set of concepts and standards for connecting data on the web and across data silos. See: http://linkeddata.org/. Linked Life Data “A semantic data integration platform for the biomedical domain” - see http://linkedlifedata.com. It includes the Unified Medical Language System (UMLS) which includes SNOMED CT as well as Drugbank, both used in the Linked Data Project demonstrator site at demonstrator.dev.cochrane.org Metadata Put simply, “data about data”. Data that describes your content. Ontology “An ontology is a specification of a conceptualization.” Ontologies in the semantic web are used to describe a domain included the classes and properties and relationships between things. OWL The Web Ontology Language. See: http://www.w3.org/TR/owl-features/. OWLiM A semantic repository software or “triple store” currently used in the Cochrane Linked Data Project. See: http://www.ontotext.com/owlim. RDF Resource Description Framework. A data model for storing data in “triples”. See: http://en.wikipedia.org/wiki/Resource_Description_Framework. RDFS RDF Schema language. http://en.wikipedia.org/wiki/RDF_Schema Semantic Web See videos above! SNOMED CT Systematized Nomenclature of Medicine -- Clinical Terms. A controlled vocabulary of medical terms. See: http://en.wikipedia.org/wiki/SNOMED_CT. SPARQL SPARQL Protocol and RDF Query Language. The query language for querying data in RDF format. See: http://en.wikipedia.org/wiki/SPARQL. SPARQL Views A Drupal module that integrates the SPARQL query languages with the Views module to create displays of content on a website. Taxonomy Less formal way of creating a system to organize content. Note: there is substantial debate the difference between ontologies, taxonomies, controlled vocabularies and thesauri! Triples RDF triples. In the RDF data model, data is stored as triples with a subject – predicate – object. There are multiple serializations for RDF including RDF-XML, Turtle and N-3. Triple store From Wikipedia: A triplestore is a purpose-built database for the storage and retrieval of triples,[1] a triple being a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred". See: http://en.wikipedia.org/wiki/Triplestore.