Adding Value to Oceanographic Data at the British Oceanographic Data Centre Roy Lowry, Lesley Rickards and Juan Brown British Oceanographic Data Centre PV2005 Conference, Edinburgh Presentation Overview The oceanographic data management culture Physical security of data at BODC Adding value to data through format harmonisation, quality assurance and accessibility Adding value to data through metadata enhancement Achievements Some plans for further enhancement Conclusions PV2005 Conference, Edinburgh Culture The marine domain has no respect for national boundaries Oceanographic sampling is sparse Research vessels and moorings are small dots on a map covering 67% of the Earth’s surface Satellites give global coverage but only of the surface waters Oceanographic measurements are extremely expensive to collect and unrepeatable as the environment is constantly changing These factors combine to make sharing data an attractive proposition PV2005 Conference, Edinburgh Culture Consequently, there is a long history of national and international oceanographic data management and curation The Intergovernmental Oceanographic Commission (IOC) founded International Oceanographic Information and Data Exchange (IODE) network in 1961 PV2005 Conference, Edinburgh Culture IODE system forms a worldwide service oriented network of DNAs (Designated National Agencies) NODCs (National Oceanographic Data Centres), RNODCs (Responsible National Oceanographic Data Centres) WDCs (World Data Centres – Oceanography) During the past 40 years, IOC Member States have established over 60 oceanographic data centres in as many countries This network has been able to collect, control the quality of, and archive millions of ocean observations, and makes the data available to Member States. PV2005 Conference, Edinburgh Culture BODC is the UK’s NODC in the IODE network Formed as the Marine Information Advisory Service Data Banking Section in 1979 Established as BODC in 1989 PV2005 Conference, Edinburgh Culture Data curation is high on the agenda in other environmental science domains in the UK BODC is one part of a network of designated data centres maintained by the Natural Environment Research Council (NERC) PV2005 Conference, Edinburgh Culture BODC is NERC’s designated data centre for marine science Responsible for assuring the longterm stewardship of NERC-funded oceanographic data Currently has 35 staff hosted by the Proudman Oceanographic Laboratory on the Liverpool University campus PV2005 Conference, Edinburgh Physical Security Three parts to BODC’s physical security strategy for data Accession system Backup system Media preservation PV2005 Conference, Edinburgh Accession System Incoming data transferred to BODC Archive Files copied as is with lowest common denominator (usually flat ASCII) conversions where appropriate Metadata record describing what was submitted, when and from whom PV2005 Conference, Edinburgh Backup System Comprehensive backup of file systems and Oracle database to DLT Managed combination of full and incremental backups Procedures to guard against backup media degradation Tapes stored in fire safes at two locations on campus and one external location PV2005 Conference, Edinburgh Media Preservation BODC have received data on Paper (cards, listings and tape) Magnetic tape (7-track, 9-track, QIC, DLT) Floppy disks of many sorts (8”, 5.25”, 3.5”, 3”) CD-ROM and DVD-ROM Optical disks FTP and e-mail PV2005 Conference, Edinburgh Media Preservation Data from all incoming media are transferred to the BODC archive Archive physical media have been 9-track magnetic tape Phase-change optical disk Magneto-optical disk Currently use spinning disk (backed up to DLT) and a mirrored tape robot mass store Four physical media migrations of exponentially increasing data volumes PV2005 Conference, Edinburgh Adding Value to Data ‘High’ volume data are harmonised into a single format (NetCDF subset) from over 300 formats submitted ‘Low’ volume data are integrated into unified schema in an Oracle database Data are visually inspected as graphical presentations by scientifically skilled staff Problem data are marked by manipulating quality control flags tagging each data value PV2005 Conference, Edinburgh Adding Value to Data What do these procedures achieve? Syntactic harmonisation and data integration ease data synthesis in a poorly standardised data culture Flagging facilitates data recycling Exposes ‘hidden’ data quality issues facilitating reuse with no reference to the data originator Allows data to be used for purposes other than those for which it was collected – e.g. outliers are tolerable when determining means, but not extremes PV2005 Conference, Edinburgh Adding Value to Data Lost (or hidden) data have no value NERC’s marine data are available through BODC for reuse to: The UK marine science community The IOC IODE data centre network The European SEA-SEARCH community Further enhancement through database federation under development Across NERC through NERC DataGrid Across European data centres through SeaDataNet PV2005 Conference, Edinburgh Metadata Enhancement BODC’s data stewardship mission is to facilitate data recycling on a decadal timescale with no need for consultation with the originator This is essential to provide reliable baselines for the quantification of Global Change …and it can only be achieved through metadata enhancement PV2005 Conference, Edinburgh Metadata Enhancement Metadata content BODC metadata content model specifies minimum requirement for description of oceanographic data Model is populated by collation of information from multiple sources, including resolution of contradictory information Missing information is vigorously pursued PV2005 Conference, Edinburgh Metadata Enhancement Metadata content Result is a guaranteed minimum standard for metadata content This minimum standard is exceeded wherever possible by compiling all available information into data documentation PV2005 Conference, Edinburgh Metadata Enhancement Parameter Descriptions Parameter labels used by scientists are often ambiguous, parochial or obscure. For example, water column temperature often labelled as ‘temp’ Could also be air temperature or a temporary variable to be thrown away Measurand label semantics are surprisingly volatile – they usually get buried in a notebook or committed to memory and then forgotten PV2005 Conference, Edinburgh Metadata Enhancement Parameter Descriptions BODC maintains a parameter usage vocabulary with over 17,400 entries systematically describing measured phenomena in the oceanographic domain Major investment in the dictionary through the EnParDis project started developing it towards international standard status Term definitions are now automated concatenations of semantic model elements, each constrained by a controlled vocabulary All originator parameter labelling is replaced by terms from this dictionary PV2005 Conference, Edinburgh Achievements Context BODC specialises in observational oceanographic data This is typically low volume, with datasets usually sub-gigabyte However, it is extremely diverse (hundreds or even thousands of parameters per dataset) and is useless without comprehensive metadata Days of work at sea frequently return only handful of data values Every number therefore has to be considered as a valuable commodity PV2005 Conference, Edinburgh Achievements Data available for recycling and comparative studies include: Over 1,500,000 chemical and biological measurements on water samples covering over 5,000 different parameters dating back to 1988 Over 60,000 profiles of temperature and salinity dating back to 1968 Over 7,000 site years of sea level data dating back to 1842 Over 6,000 current meter records dating back to 1967 PV2005 Conference, Edinburgh Plans for the Future Problems The BODC content model was designed in the late 1970s and whilst adequate it is ‘light’ by today’s standards Additional information is available in text (XHTML) documents, but software agent access is limited to free-text searching The desire for data interoperability with other organisations is frustrated by technical obstacles such as parameter description diversity PV2005 Conference, Edinburgh Plans for the Future Solutions Augment the BODC metadata model to the standard required to populate the NERC DataGrid metadata model Migrate from plaintext documentation to structured metadata constrained by standardised vocabularies Further develop the parameter descriptions to address issues such as synonyms Develop parameter mapping ontologies to support database interoperability and database federation PV2005 Conference, Edinburgh Conclusions Data curation has been high on the oceanographic agenda for over 40 years Data curation is high on NERC’s agenda, producing an enclave where preservation of data is the rule rather than the exception BODC has made a significant contribution to the curation of UK oceanographic data We aim to continue and to further raise our standards of service in the future PV2005 Conference, Edinburgh