Best practices for sharing and archiving datasets

advertisement
SUMMARY
Best Practices for Sharing and Archiving Datasets
22 September 2011
The following summarizes best practices for datasets contributed to the Polar Data Catalogue
(PDC, www.polardata.ca). Extended best practices are available in the document “Polar Data
Catalogue Best Practices for Sharing and Archiving Datasets” available on the PDC website.
1. Providing Metadata
 This is the first step for datasets. Metadata are published online in the PDC in FGDC
format. Information on how to create metadata is available at www.polardata.ca.
 Title of the metadata record should be representative of the accompanying dataset.
 It may be necessary to update metadata records after preparation of datasets for
submission to the PDC.
2. Assigning Descriptive Metadata Titles and Data File Names
 Titles and names should be as clear and descriptive as possible and unique for each
metadata record and data file. Remember that this information may be accessed by
people unfamiliar with the project.
 Data file names may contain acronyms such as project, study site, etc.
 It is recommended that file names be included in the header rows of the data file itself.
 File names should contain only numbers, letters, dashes, and underscores – no spaces
or special characters. In PDC, file names should not be more than 150 characters in
length and should include data file creation date or version number.
 File Type or Extensions *.txt and *.csv are preferred for tabular data (but *.xls may be
acceptable - avoid *.xlsx as it is not backward-compatible).
 Example of good metadata and data file names:
Metadata record: HPLC Pigment Analysis of the Phytoplankton Community in
Franklin Bay
Data file: CASES_HPLC_Franklin_20090914.xls
(Note: The CCIN Reference Number will be automatically pre-pended to the file name upon file
submission to the PDC. In this case, the final file name would be
CCIN226_CASES_HPLC_Franklin_20090914.xls)
3. Using Consistent and Stable File Formats for Tabular and Image Data
 A consistent format that can be read well into the future and is independent of changes
in applications is preferred.
 TEXT or ASCII files have the best longevity. Microsoft Excel is also now widely used
and is acceptable but not preferred; Excel files can easily be converted into text (*.txt)
files.
 Consistent file format should be used for all data files belonging to the same project.
 Figures and analyses should be reported in companion documentation.
 The first row should contain descriptors that link the data file to the dataset/metadata
(e.g., data file name, dataset/metadata title, author, date the data within the file were
last modified, and companion file names).
 Non-proprietary file formats are preferred for image (raster) data.
1


Data that are provided in a proprietary software format must include documentation of
the software specifications.
Common formats such as JPG, PNG, mpeg, wmv and avi are preferred for photo and
video files.
4. Defining the Contents of Data Files
 Parameter names, units, and coded fields of datasets must be defined. These can be
placed in a table in a companion document. Examples are provided in the full PDC
Best Practices document.
5. Using Consistent Data Organization
 The most frequent data file organization consists of a matrix in which each row
represents a complete record, and the columns represent all the parameters that make
up the record. Examples are provided in the full PDC Best Practices document.
 Similar information should be kept together.
 Files should be organized by data type when appropriate.
6. Performing Basic Quality Assurance
 Data quality is the responsibility of the researcher.
 File format, missing values, coordinates, documentation, etc. should be checked.
 Projection values for image vector and raster data should be verified.
7. Providing Dataset Documentation
 Accompanying documents (and metadata records) need to be written for a user who is
unfamiliar with the project, sites, methods, or observations. Documentation files
should be provided in non-proprietary formats and should include examples of data
included in the dataset.
 The dataset documentation template below is provided below to facilitate completion
of documentation. The template is also available in the “Polar Data Catalogue Best
Practices for Sharing and Archiving Datasets” document and on the PDC web site.
For assistance with ArcticNet metadata and data, please contact Josée Michaud, Data
Manager at ArcticNet, 418-656-2411, pdc@arcticnet.ulaval.ca
For assistance with IPY metadata and data, please contact your project’s Data Assembly
Centre Network coordinator:
 Julie Friddell, Polar Data Catalogue, 519-888-4567 x32689, julie.friddell@uwaterloo.ca
 Chuck Humphrey, University of Alberta, 780-492-9216, chuck.humphrey@ualberta.ca
 Lindsay Johnston, University of Alberta, 780-492-5946, lindsay.johnston@ualberta.ca
 Amber Leahey, Scholars Portal, 416-978-6177, amber.leahey@utoronto.ca
 Stephen Marks, Scholars Portal, 416-946-0300, stephen@scholarsportal.info
 Mathieu Ouellet, DFO, 613-990-8570, mathieu.ouellet@dfo-mpo.gc.ca
2
README file template
This page may be saved as a TEXT file named “README_{Metadata/Dataset
title}_{Today’s date}.txt” and submitted along with the data files. The outline below should
be completed with information relevant to the submitted dataset:
Mandatory information:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
File names, directory structure (for complex datasets), and brief description of each file
or file type:
Definitions of acronyms, site abbreviations, or other project-specific designations used
in the data file names or documentation files, if applicable:
Definitions of special codes, variable classes, GIS coverage attributes, etc. used in the
data files themselves, including codes for missing data values, if applicable:
Description of the parameters (column headings in the data files) and units of measure
for each parameter/variable:
Uncertainty, precision, and accuracy of measurements, if known:
Environmental conditions, if appropriate (e.g., cloud cover, atmospheric influences,
etc.):
Method(s) for processing data, if data other than raw data are being contributed:
Standards or calibrations that were used:
Specialized software (including version number) used to prepare and/or needed to read
the dataset, if applicable:
Quality assurance and quality control that have been applied, if applicable:
Known problems that limit the data's use or other caveats (e.g., uncertainty, sampling
problems, blanks, QC samples):
Date dataset was last modified:
Related or ancillary datasets outside of this dataset, if applicable:
Optional information:
14.
15.
16.
Methodology for sample treatment and/or analysis, if applicable:
Example records for each data file (or file type):
Files names of other documentation that are being submitted along with the data and
that would be helpful to a secondary data user, such as pertinent field notes or other
companion files, publications, etc.:
3
Download