Study Discovery in Support of the Data Without Boundaries Initiative, the NIH Data Documentation Index and Infonomics Jay Greenfield Booz Allen Hamilton DDI 2014 iAssist Sprint Toronto, ON Agenda • Introduce three initiatives that a DDI 4 Discovery functional view needs to support – Data Without Boundaries (DwB) – NIH Data Discovery Index (NCI DDI) – The Infonomics Use Case • In this context consider some SDMX-based, GSIM and DDI Dublin Core-based information objects with which a DDI 4 Discovery view may need to be aligned • In view of these information objects consider the completeness of DISCO 2 USE CASES 3 The DwB and NIH DDI Use Cases • In both DwB and NIH DDI aggregate datasets are a subject for discovery together with micro datasets – The DwB Metadata Model includes both elements from DDI 3 and SDMX with the idea of using aggregate data to “provide context for searches for microdata” – Likewise NIH DDI seeks to spawn a pilot project that “would work with interested journals (such as PLoS, BMC, or Nature Genetics) to require that every table and figure links out to original data and software” 4 Infonomics, Citation and the NIH DDI From GSIM 1.1: Represented and Instance Variables 5 Infonomics, Citation and the NIH DDI • GSIM has introduced the represented variable • It is akin to constructs and common data elements whereas instance variables are actual measures • NIH DDI has suggested that we attach citations to constructs and datasets because “citations are a metric that can be used by NIH and the academic communities to assess scholarly activity” • Such “assessments” are central to infonomics which seeks to find and define metrics that can be used in the valuation of information 6 MEET THE INFORMATION OBJECTS 7 The RDF Data Cube Vocabulary Dimension Dimension Dimension Measure 8 The RDF Data Cube Vocabulary Slice 9 Represented variables and infonomics has Citation Citations, when associated with represented variables (CDEs) enable resource valuation or, again, infonomics 10 Represented variables and infonomics Citation 11 Represented variables and infonomics Citation • A represented variable can have many citations • Citations conform to Dublin Core and cover 15 domains as well as keywords from thesauri like MeSH • Using MeSH enables programmatic search for articles in PubMed • By comparing and compiling the citations, evaluations of represented variable and datasets can be undertaken in support of reviews by governance groups including NIH and OMB 12 Represented variables and infonomics Citation • In DDI Dublin Core (DC) is expressed in XML • Natively, DC is specified in DC UML and DC RDF/XML • Using DC RDF/XML and a standard RDF query engine, it is possible to observe and analyze relationships between citations both within and between represented variables Possible Partner: Metadata Technology 13 Represented variables and infonomics Citation 14 Represented variables and infonomics Citation • MeSH vocabulary is used for indexing journal articles citations hosted by PubMed • PubMed hosts more than 23 million citations for biomedical literature from MEDLINE, life science journals, and online books • PubMed supports both human searchers at its portal and software agents by way of Entrez • PubMed indexes citations using both MeSH Medical Subject Headers and MeSH subheadings 15 DISCO COMPLETENESS 16 In DDI 4 might we want to revisit the DISCO discovery view? 17 In DDI 4 might we want to revisit the DISCO discovery view? • Including more elements from the RDF Data Cube Vocabulary (the qb namespace in DISCO) can lend additional specificity to search: – In which studies was a specific analysis undertaken and reported – How comparable was the micro data that went into these analyses? 18 In DDI 4 might we want to revisit the DISCO discovery view? • Including GSIM represented variables and connecting elements from the the Dublin Core RDF Citation Vocabulary to represented variables and datasets opens the way to an ecosystem of crawlers: – Software agents can search citation databases for new publications – Other data resources might be linked in • They might include “existing domain-specific repositories, institutional data repositories, or other resources including commercial clouds” 19 Could there be more than one DISCO? • Dublin Core motivates itsDublin Core Application Profiles (DCAP) with this introduction: – When it comes to metadata, one size does not fit all. In fact, one size often does not even fit many. The metadata needs of particular communities and applications are very diverse. The result is a great proliferation of metadata formats, even across applications that have metadata needs in common. 20 Could there be more than one DISCO? – The Dublin Core Metadata Initiative has addressed this by providing a framework for designing a Dublin Core Application Profile (DCAP). A DCAP defines metadata records which meet specific application needs while providing semantic interoperability with other applications on the basis of globally defined vocabularies and models. • In line with this vision in its DCAP guidelines document Dublin Core introduces the Singapore Framework 21 Could there be more than one DISCO? The Singapore Framework 22 Could there be more than one DISCO? The Singapore Framework • The Singapore Framework is a standard, not an information model • Perhaps the middle layer “Domain standards” might be analogous to a DDI 4 Discovery package • Then, in place of DISCO, there might be multiple application profiles or, again, views • In this context imagine that DDI 4 might publish at least two such “official” ones • If you had your druthers, what would these two profiles be? 23 24