Responsible Data Use and Local Data Management Ruth Duerr National Snow and Ice Data Center Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Overview • Responsible Data Use • • • • Fair access and use Data restrictions Citation and credit Providing feedback • Local Data Management • • • • • File names Directory structures Backing up your data Data formats Documentation and metadata Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Responsible Data Use (or what should you do if you find yourself re-using someone else’s data) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Your Responsibilities as a Data User • Determining the suitability of data for your purposes • Following applicable data access and use policies • Giving credit to archives and data creators • Providing the data source with feedback about any errors or limitations with the data discovered Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Just because it is “good” data, doesn’t mean that it is right for your project! Corollary Just because it isn’t right for your project, doesn’t mean that it is “bad” data! Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Hints for Determining Data Suitability • Read any papers, documentation and metadata provided – it is there for a reason! • See http://nsidc.org/data/mod10a1v5.html for an example of a fairly well documented data set • If you still have questions, assess support availability and if acceptable ask! • See http://nsidc.org/data/g02199.html for an example of a poorly documented data set with an extremely low level of available support • Be aware that due to documentation and support limitations, the best data for your purposes may not be available to or usable by you Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar A few words about data access and use • The trend in many disciplines is towards greater data sharing, but… • Norms vary by discipline (and country), for example you may need to • Submit an application for access • Sign a data transfer and usage agreement • Travel to the repository to obtain access • Moreover there are legitimate reasons for restricting access, for example: • To protect the confidentiality of human subjects • To protect the rights of local and traditional knowledge holders • To protect information that if released may cause harm (e.g., location of endangered species, sacred sites, etc.)1 • It is your responsibility to understand and follow the norms for the data your are using 1 see IPY Data Policy at classic.ipy.org/Subcommittees/final_ipy_data_policy.pdf Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Would you share your data if you didn’t know that you were going to be given credit for your work? So cite the data you use! Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Data Citation – Now • Currently data citation standards and requirements vary 1. 2. 3. 4. From journal to journal From repository to repository From discipline to discipline Some times from author to author • Do your best to honor these existing norms • What might a data citation look like? • Zwally, H.J., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2003. GLAS/ICESat L1A Global Altimetry Data V018, 15 October to 18 November 2003. National Snow and Ice Data Center. Data set accessed 2011-07-21 at doi:10.3334/NSIDC/gla01. Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Data Citation – In the Near Future • DataCite and other groups are working to make data citation a normal part of the scientific process • For example, as of this year Thompson-Reuters Web of Science and Web of Knowledge include published data sets (i.e., that have a DOI) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Why provide feedback? • Prevent other users from repeating your mistakes • Improve the data or their documentation • Better science, perhaps even new results, papers, and collaborators Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar A few words about providing feedback • Feedback to a PI • Your reasons for using someone else’s data are likely different than their reasons for acquiring it in the first place • So, they probably weren’t thinking of your needs when they acquired, documented and made it available • Yet, if they thought their data would be useful to a community they probably would be eager to help • Diplomacy and tact may be called for (especially if you really think you’ve found an error not just a documentation problem) • Feedback to a data center is almost always welcome Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Local Data Management (or managing your own data) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar The 5 P’s matter! (prior planning prevents poor performance) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Local Data Management • File names • Directory structures • Backing up your data • Data formats • Documentation and metadata Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Dilbert’s file naming convention Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Assign descriptive file names • File names should be unique and reflect the file contents • Bad file names • Mydata • 2001_data • A better file name might be • bigfoot_agro_2000_gpp.tif • • • • • BigFoot is the project name Agro is the field site name 2000 is the calendar year GPP represents Gross Primary Productivity data tif is the file type – GeoTIFF • But only if you document the naming convention! Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Organize files logically Biodiversity • Make sure your file system is logical and efficient Lake Biodiv_H20_heatExp_2005_2008.csv Experiments Field work Biodiv_H20_predatorExp_2001_2003.csv … Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv … Grassland Courtesy of S. Hampton, UC-Santa Barbara Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Backup Your Data!!! Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Why? You think it's easy to recover data off a • Broken DVD • A burned up memory stick • A drowned laptop • A crashed hard drive Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Backing up your data files • Create back-up copies often • Ideally three copies • original, one on-site (external), and one off-site • Frequency based on need / risk • Higher value data should be backed up more often • Sensor data collected at high frequency should be backed up more frequently • Ensure that all backup copies are identical to the original files • Use checksums or file comparisons Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Test your backups • Automatically test backup copies of files frequently to ensure they are viable • Media degrade over time • Test copies using check sum or file compare • Be certain that you can recover from a data loss • Periodically test your ability to restore information (at least once a year) • Simulate an actual loss, by trying to recover solely from the backed up copies Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Data Formats – Best Practices • Don’t use a proprietary format! • These have a short shelf life and will probably become unreadable after a few years • Don’t invent your own format! • No one but you will have the tools to read it • Use open source, well-documented, community-based standard formats where ever possible especially if they are self-describing Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Self-describing data formats • Information describing the data contents of the file are embedded within the data file itself: • Names for various fields • Data types – Standardized, portable, machine independent • Pointers to various fields, making it efficient to extract the particular fields you want without reading the entire file • Attributes and flags related to the primary fields with extra information such as units, fill values, etc. • Include a standard API and portable data access libraries in a variety of languages • There are tools that can open and work with arbitrary files, using the embedded descriptions to interpret the data. Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Some example self-describing formats • HDF – Hierarchical Data Format • HDF4 and HDF5 versions are in use today • A NASA variant called HDF-EOS is used within the Earth Observing System program. • NetCDF – Network Common Data Form • Widely used by agencies including NASA and NOAA • Climate and forecast (CF) metadata conventions help standardize some things into NetCDF in a common manner. Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Documentation (metadata) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Poor data practice results in loss of information Time of publication Information Content Specific details General details Retirement or career change Accident Death Time (Michener et al. 1997) Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Don't you think it would be more efficient • If you didn't have to remember • the name of that file? • and the directory where you put it? • the units those measurements were taken in? • which sample site was which? • etc. Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Making Your Research Easier and Cheaper Write it down! Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Needed Documentation Who: Data Set Creator and Contact Where: Geographic Extent and Location of Data Set Coverage What: Title of Data Set and Keywords Describing the Data Set When: Temporal Coverage of the Data Set Why: Description and Purpose of the Data Set How: How the Data Set was Created and How to Access the Data Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar Documentation Best Practices • Document your conventions as they’re established • Revise documentation contemporaneously, not “after the fact” • This work is also the basis for end-user or reviewer documentation • What should you document? Everything! • Data import, manipulation, QC procedures, special flags and encoding • Naming conventions, layouts, headings, units and abbreviations • Does TEMP mean “temporary,” “air temperature at time of observation,” or ? • Formulae and constants Responsible Data Use and Local Data Management; Presented 8 Nov 2011, APECS Webinar References and Resources • Michener, W.K., Brunt, J.W., Helly, J.J., Kirchner, T.B. and S. G. Stafford. 1997. “Nongeospatial metadata for the ecological sciences.” Ecological Applications 7(1):330-342. • Data management training materials in development are available at http://wiki.esipfed.org/index.php/Data_Management_Course _Outline • A short list of data management related resources available on the web can be found at http://wiki.esipfed.org/index.php/Data_Management_Resour ces