RDA WG Data Citation – Use Case Details Use Case Name: UK Data Service collection (ESRC data centre) Institution: University of Essex, UK Data Service http://www.ukdataservice.ac.uk/ (data) www.data-archive.ac.uk (corporate) Domain: The UK Data Archive leads the Economic and Social Research Council’s (ESRC) flagship national UK Data Service which is a comprehensive resource to support researchers, teachers and policymakers who depend on high-quality social and economic data. We provide a single point of access to a wide range of secondary data including large-scale government surveys, international macrodata, business microdata, qualitative studies and census data from 1971 to 2011. Data Characteristics: Our data are held as ‘collections’ which correspond to studies, a unique piece of fieldwork, or from data compilation (e.g. a digitisation project). Examples include crosssectional surveys, longitudinal surveys over time, a qualitative research projects that yields a set of interviews and images, or a historical database. We have over 6500 collections, and some are major UK survey series or censuses dating back to the 1960s. We prepare a single catalogue record for each ‘collection’ which has about 30 elements and the record is compliant with the Data Documentation Initiative (DDI) and also pushed out as OAI. Type: Numeric, structured, unstructured text, images, audio. Storage: Metadata are stored as XML files and data are held in preservation-friendly formats on a hierarchical preservation system. Access: Data are accessed primarily through web download via the catalogue as zip bundles in specific user-oriented formats per collection. UK Federation authentication is required for most but some are completely open. Standard formats are spss, stata and rtf, and some mp3/jpg. We have a few online data browsing systems that enable direct search, browsing and graphing of data held on servers, some with Shibboleth authentication. Current citation approach: Each record is assigned a DataCIte DOI when published (format is of the type: 10.5255/UKDA-SN-3314-1). We have a distinct methodology for assigning and versioning DOIs. Basically we distinguish between low and high impact changes, with high impact 1 changes promoting a new DOI (with an increment -1, -2 etc.). The DOI resolves to a jump page which lists the history of changes (see 10.5255/UKDA-SN-7037-3). Access is only provided for the most current version, as often changes have been made due to errors, or updates that make the older versions inadvisable to use. Also we have no requests for older versions, as most users are looking for the most up to date information. Our low impact changes, which do not prompt a change on DOI, include correcting typos or other small changes in labels. Higher impact changes include addition or removal of a variable or significant new documentation. We use the APA citation format: Office for National Statistics. Social Survey Division and Northern Ireland Statistics and Research Agency. Central Survey Unit, Quarterly Labour Force Survey, January - March, 2012 [computer file]. 3rd Edition. Colchester, Essex: UK Data Archive [distributor], November 2013. SN: 7037,http://dx.doi.org/10.5255/UKDA-SN-7037-3 We also run an Eprints data repository (http://reshare-integration.ukdataservice.ac.uk/) for longer tail research data, that assigns DOIs at the point of publishing but we have not yet agreed on how to show changes in versions, or whether we allow access to older versions. Ideal way of citing: While we think we do have the ideal solution for our curated collection, we do wish to look at how these relate to citation of data accessed via our online data access systems, which are often versions of data in different formats (often more open, e.g. XML). We still need to investigate persistent citation for subsets of our numeric data create on the fly via our online access systems, such as our survey browser and analysis tool, Nesstar (see later under subsets). Other aspects: None at the moment Contact info: Louise Corti (corti at essex dot ac dot uk) Details: The data model: For our data that are accessed we enable zip bundles to be created from a directory structure for different types of files, such as data distinguished and bundled by file type. Mostly SPSS, STATA, RTF, EXCEL. Our XML files in our online systems make use of DDI, QuDEX and TEI metadata schemas. http://data-archive.ac.uk/curate/standardstools/metadata Versioning/timestamping: Current or planned Data are timestamped. Dynamic data are treated as a series of editions; each addition of data replaces and supersedes the previous published dataset. For some of our collections, a versioned DOI is used with a full version history of changes made available. 2 Dynamics: e.g. how much data, how much added in which time intervals, any corrections/updates or just additions Most data are one off and we only use DOIs for dynamic data for which we have about 750/7000 collections. Some of our longitudinal datasets that occur in sweeps of annual field work are updated as new editions when the data are incrementally merged into a single file (see 10.5255/UKDA-SN-6614-5). In other cases where we have quarterly update, are updated as new editions every 3 months. We do also initiate a versioned DOI when we have version change e.g. changes made to existing data, not adding a new wave of data to the existing set. Screenshots: Interface/workbench that researchers are using to create subsets The DOI files is quite visible ion our DDI collection-level catalogue record. It references to a landing page with a version history. 3 Example of subsets: How they were created, what they look like, to get a feeling of what/how researchers would like to use the data and cite it. We don’t manually create DOIs for subsets, except in the small case of specially created teaching datasets which are cut down datasets with restricted number of available. These get a new DOI. For our online data access tools we use two approaches. 1. Our text browsing system QualiBank does enable citation of parts of a collection and a text object. (http://discover.ukdataservice.ac.uk/QualiBank). This system enables searching and browsing of qualitative data, interviews and open-ended questions, we have introduced object and paragraph level citation, again using AP citation format. We are not using DOIs here, as each object comes from a higher level UK data service collection (already using the DOI method above), but we will consider assigning object levels DOIs in this system. We have structured metadata for each citation, which make use of system level GUIDs. When paragraphs are selected from a web page displaying text, the GUIDs are aggregated into a new citation object stored in a live citation database. We use a limited citation metadata schema for this. Each text fragment has a persistent guid (prefixed with "q-" in the QualiBank system). When the user selects one or more fragments in the UI, these fragment guids (and the parent text document guid) are assembled by JavaScript into an HTTP GET. This invokes an XQuery on the citation XML database via a RESTful API and passes the text fragment guids plus the parent text document guid as parameters in the URL. The XQuery: a. Creates a persistent citation identifier and concatenates a citation URL 4 b. c. d. e. Looks up the DOI of the parent dataset of the text document guid Looks up other DDI2.5 metadata associated with the dataset. Concatenates readable citation text using the above values. Inserts an xml citation record into the database (including all of the original text fragment guids) f. Returns a JSON response to the UI, including citation text for the user to cite - this includes the DOI of the dataset and a citation URL to enable as user to return to and highlight the relevant text fragments later. It's important to note that the DOI itself is *not* generated by this process and is retrieved from elsewhere as a pre-existing identifier for the parent dataset. It's important to note that the DOI is *not* the same as the citation identifier (in this system anyway). 2. For our numeric data online system, Nesstar we do not use DataCite DOIs. The URI is persistent and can refer to subsets. This link subsets a single variable – sex for the whole dataset, prior to download. http://nesstar.ukdataservice.ac.uk/webview/index.jsp?v=2&s=V3&analysismode=table&stu dy=http%3A%2F%2F155.245.69.3%3A80%2Fobj%2FfStudy%2F7140&gs=%2CVG%2CVG1%2 C%2C&var1=http%3A%2F%2Fdanessukds.essex.ac.uk%3A80%2Fobj%2FfVariable%2F7140_V 3&mode=download&top=yes We are still considering how useful it is to enable citations of subsets at this level or to refer to the whole data collection. 5