infonomics

advertisement
Study Discovery in Support of
the Data Without Boundaries Initiative,
the NIH Data Documentation Index
and Infonomics
Jay Greenfield
Booz Allen Hamilton
DDI 2014 iAssist Sprint
Toronto, ON
Agenda
• Introduce three initiatives that a DDI 4 Discovery
functional view needs to support
– Data Without Boundaries (DwB)
– NIH Data Discovery Index (NCI DDI)
– The Infonomics Use Case
• In this context consider some SDMX-based, GSIM
and DDI Dublin Core-based information objects
with which a DDI 4 Discovery view may need to
be aligned
• In view of these information objects consider the
completeness of DISCO
2
USE CASES
3
The DwB and NIH DDI Use Cases
• In both DwB and NIH DDI aggregate datasets are
a subject for discovery together with micro
datasets
– The DwB Metadata Model includes both elements
from DDI 3 and SDMX with the idea of using aggregate
data to “provide context for searches for microdata”
– Likewise NIH DDI seeks to spawn a pilot project that
“would work with interested journals (such as PLoS,
BMC, or Nature Genetics) to require that every table
and figure links out to original data and software”
4
Infonomics, Citation and the NIH DDI
From GSIM 1.1: Represented and Instance Variables
5
Infonomics, Citation and the NIH DDI
• GSIM has introduced the represented variable
• It is akin to constructs and common data
elements whereas instance variables are actual
measures
• NIH DDI has suggested that we attach citations to
constructs and datasets because “citations are a
metric that can be used by NIH and the academic
communities to assess scholarly activity”
• Such “assessments” are central to infonomics
which seeks to find and define metrics that can
be used in the valuation of information
6
MEET THE INFORMATION OBJECTS
7
The RDF Data Cube Vocabulary
Dimension
Dimension
Dimension
Measure
8
The RDF Data Cube Vocabulary
Slice
9
Represented variables and
infonomics
has
Citation
Citations, when associated with represented variables (CDEs)
enable resource valuation or, again, infonomics
10
Represented variables and
infonomics
Citation
11
Represented variables and
infonomics
Citation
• A represented variable can have many citations
• Citations conform to Dublin Core and cover 15
domains as well as keywords from thesauri like
MeSH
• Using MeSH enables programmatic search for
articles in PubMed
• By comparing and compiling the citations,
evaluations of represented variable and datasets
can be undertaken in support of reviews by
governance groups including NIH and OMB
12
Represented variables and
infonomics
Citation
• In DDI Dublin Core (DC) is
expressed in XML
• Natively, DC is specified in DC
UML and DC RDF/XML
• Using DC RDF/XML and a
standard RDF query engine,
it is possible to observe and
analyze relationships
between citations both within
and between represented
variables
Possible Partner: Metadata Technology
13
Represented variables and
infonomics
Citation
14
Represented variables and
infonomics
Citation
• MeSH vocabulary is used for indexing journal
articles citations hosted by PubMed
• PubMed hosts more than 23 million citations for
biomedical literature from MEDLINE, life science
journals, and online books
• PubMed supports both human searchers at its
portal and software agents by way of Entrez
• PubMed indexes citations using both MeSH
Medical Subject Headers and MeSH subheadings
15
DISCO COMPLETENESS
16
In DDI 4 might we want to revisit the
DISCO discovery view?
17
In DDI 4 might we want to revisit the
DISCO discovery view?
• Including more elements from the RDF Data Cube
Vocabulary (the qb namespace in DISCO) can lend
additional specificity to search:
– In which studies was a specific analysis undertaken
and reported
– How comparable was the micro data that went into
these analyses?
18
In DDI 4 might we want to revisit the
DISCO discovery view?
• Including GSIM represented variables and
connecting elements from the the Dublin Core
RDF Citation Vocabulary to represented variables
and datasets opens the way to an ecosystem of
crawlers:
– Software agents can search citation databases for new
publications
– Other data resources might be linked in
• They might include “existing domain-specific repositories,
institutional data repositories, or other resources including
commercial clouds”
19
Could there be more than one DISCO?
• Dublin Core motivates itsDublin Core Application
Profiles (DCAP) with this introduction:
– When it comes to metadata, one size does not fit all.
In fact, one size often does not even fit many. The
metadata needs of particular communities and
applications are very diverse. The result is a great
proliferation of metadata formats, even across
applications that have metadata needs in common.
20
Could there be more than one DISCO?
– The Dublin Core Metadata Initiative has addressed
this by providing a framework for designing a Dublin
Core Application Profile (DCAP). A DCAP defines
metadata records which meet specific application
needs while providing semantic interoperability with
other applications on the basis of globally defined
vocabularies and models.
• In line with this vision in its DCAP guidelines
document Dublin Core introduces the Singapore
Framework
21
Could there be more than one DISCO?
The Singapore Framework
22
Could there be more than one DISCO?
The Singapore Framework
• The Singapore Framework is a standard, not an
information model
• Perhaps the middle layer “Domain standards”
might be analogous to a DDI 4 Discovery package
• Then, in place of DISCO, there might be multiple
application profiles or, again, views
• In this context imagine that DDI 4 might publish
at least two such “official” ones
• If you had your druthers, what would these two
profiles be?
23
24
Download