RDA_WGDataCitation_UseCaseDetails_UKDS

advertisement
RDA WG Data Citation – Use Case Details
Use Case Name:
UK Data Service collection (ESRC data centre)
Institution:
University of Essex, UK Data Service http://www.ukdataservice.ac.uk/ (data)
www.data-archive.ac.uk (corporate)
Domain:
The UK Data Archive leads the Economic and Social Research Council’s (ESRC) flagship
national UK Data Service which is a comprehensive resource to support researchers,
teachers and policymakers who depend on high-quality social and economic data. We
provide a single point of access to a wide range of secondary data including large-scale
government surveys, international macrodata, business microdata, qualitative studies
and census data from 1971 to 2011.
Data Characteristics:
Our data are held as ‘collections’ which correspond to studies, a unique piece of
fieldwork, or from data compilation (e.g. a digitisation project). Examples include crosssectional surveys, longitudinal surveys over time, a qualitative research projects that
yields a set of interviews and images, or a historical database. We have over 6500
collections, and some are major UK survey series or censuses dating back to the 1960s.
We prepare a single catalogue record for each ‘collection’ which has about 30 elements
and the record is compliant with the Data Documentation Initiative (DDI) and also
pushed out as OAI.
Type:
Numeric, structured, unstructured text, images, audio.
Storage:
Metadata are stored as XML files and data are held in preservation-friendly formats on
a hierarchical preservation system.
Access:
Data are accessed primarily through web download via the catalogue as zip bundles in
specific user-oriented formats per collection. UK Federation authentication is required
for most but some are completely open. Standard formats are spss, stata and rtf, and
some mp3/jpg. We have a few online data browsing systems that enable direct
search, browsing and graphing of data held on servers, some with Shibboleth
authentication.
Current citation approach:
Each record is assigned a DataCIte DOI when published (format is of the type:
10.5255/UKDA-SN-3314-1). We have a distinct methodology for assigning and versioning
DOIs. Basically we distinguish between low and high impact changes, with high impact
1
changes promoting a new DOI (with an increment -1, -2 etc.). The DOI resolves to a jump
page which lists the history of changes (see 10.5255/UKDA-SN-7037-3). Access is only
provided for the most current version, as often changes have been made due to errors, or
updates that make the older versions inadvisable to use. Also we have no requests for
older versions, as most users are looking for the most up to date information. Our low
impact changes, which do not prompt a change on DOI, include correcting typos or other
small changes in labels. Higher impact changes include addition or removal of a variable
or significant new documentation. We use the APA citation format: Office for National
Statistics. Social Survey Division and Northern Ireland Statistics and Research Agency.
Central Survey Unit, Quarterly Labour Force Survey, January - March, 2012 [computer
file]. 3rd Edition. Colchester, Essex: UK Data Archive [distributor], November 2013. SN:
7037,http://dx.doi.org/10.5255/UKDA-SN-7037-3
We also run an Eprints data repository (http://reshare-integration.ukdataservice.ac.uk/)
for longer tail research data, that assigns DOIs at the point of publishing but we have not
yet agreed on how to show changes in versions, or whether we allow access to older
versions.
Ideal way of citing:
While we think we do have the ideal solution for our curated collection, we do wish to
look at how these relate to citation of data accessed via our online data access systems,
which are often versions of data in different formats (often more open, e.g. XML). We still
need to investigate persistent citation for subsets of our numeric data create on the fly
via our online access systems, such as our survey browser and analysis tool, Nesstar (see
later under subsets).
Other aspects:
None at the moment
Contact info:
Louise Corti (corti at essex dot ac dot uk)
Details:
The data model:
For our data that are accessed we enable zip bundles to be created from a directory
structure for different types of files, such as data distinguished and bundled by file type.
Mostly SPSS, STATA, RTF, EXCEL. Our XML files in our online systems make use of DDI,
QuDEX and TEI metadata schemas. http://data-archive.ac.uk/curate/standardstools/metadata
Versioning/timestamping:
Current or planned
Data are timestamped. Dynamic data are treated as a series of editions; each addition of
data replaces and supersedes the previous published dataset. For some of our collections, a
versioned DOI is used with a full version history of changes made available.
2
Dynamics:
e.g. how much data, how much added in which time intervals, any corrections/updates or
just additions
Most data are one off and we only use DOIs for dynamic data for which we have about
750/7000 collections. Some of our longitudinal datasets that occur in sweeps of annual field
work are updated as new editions when the data are incrementally merged into a single file
(see 10.5255/UKDA-SN-6614-5). In other cases where we have quarterly update, are updated
as new editions every 3 months. We do also initiate a versioned DOI when we have version
change e.g. changes made to existing data, not adding a new wave of data to the existing
set.
Screenshots:
Interface/workbench that researchers are using to create subsets
The DOI files is quite visible ion our DDI collection-level catalogue record. It references to a
landing page with a version history.
3
Example of subsets:
How they were created, what they look like, to get a feeling of what/how researchers would
like to use the data and cite it.
We don’t manually create DOIs for subsets, except in the small case of specially created
teaching datasets which are cut down datasets with restricted number of available. These
get a new DOI.
For our online data access tools we use two approaches.
1. Our text browsing system QualiBank does enable citation of parts of a collection and a
text object. (http://discover.ukdataservice.ac.uk/QualiBank). This system enables searching and
browsing of qualitative data, interviews and open-ended questions, we have introduced
object and paragraph level citation, again using AP citation format. We are not using DOIs
here, as each object comes from a higher level UK data service collection (already using the
DOI method above), but we will consider assigning object levels DOIs in this system. We
have structured metadata for each citation, which make use of system level GUIDs. When
paragraphs are selected from a web page displaying text, the GUIDs are aggregated into a
new citation object stored in a live citation database. We use a limited citation metadata
schema for this.
Each text fragment has a persistent guid (prefixed with "q-" in the QualiBank system).
When the user selects one or more fragments in the UI, these fragment guids (and the
parent text document guid) are assembled by JavaScript into an HTTP GET. This invokes
an XQuery on the citation XML database via a RESTful API and passes the text fragment
guids plus the parent text document guid as parameters in the URL. The XQuery:
a. Creates a persistent citation identifier and concatenates a citation URL
4
b.
c.
d.
e.
Looks up the DOI of the parent dataset of the text document guid
Looks up other DDI2.5 metadata associated with the dataset.
Concatenates readable citation text using the above values.
Inserts an xml citation record into the database (including all of the original text
fragment guids)
f. Returns a JSON response to the UI, including citation text for the user to cite - this
includes the DOI of the dataset and a citation URL to enable as user to return to
and highlight the relevant text fragments later.
It's important to note that the DOI itself is *not* generated by this process and is
retrieved from elsewhere as a pre-existing identifier for the parent dataset. It's important
to note that the DOI is *not* the same as the citation identifier (in this system anyway).
2. For our numeric data online system, Nesstar we do not use DataCite DOIs. The URI is
persistent and can refer to subsets. This link subsets a single variable – sex for the whole
dataset, prior to download.
http://nesstar.ukdataservice.ac.uk/webview/index.jsp?v=2&s=V3&analysismode=table&stu
dy=http%3A%2F%2F155.245.69.3%3A80%2Fobj%2FfStudy%2F7140&gs=%2CVG%2CVG1%2
C%2C&var1=http%3A%2F%2Fdanessukds.essex.ac.uk%3A80%2Fobj%2FfVariable%2F7140_V
3&mode=download&top=yes
We are still considering how useful it is to enable citations of subsets at this level or to refer
to the whole data collection.
5
Download