Adding Value to Oceanographic Data at the British Oceanographic Data Centre

advertisement
Adding Value to Oceanographic Data
at the
British Oceanographic Data Centre
Roy Lowry, Lesley Rickards and Juan Brown
British Oceanographic Data Centre
PV2005 Conference, Edinburgh
Presentation Overview
 The oceanographic data management culture
 Physical security of data at BODC
 Adding value to data through format
harmonisation, quality assurance and
accessibility
 Adding value to data through metadata
enhancement
 Achievements
 Some plans for further enhancement
 Conclusions
PV2005 Conference, Edinburgh
Culture
 The marine domain has no respect for national
boundaries
 Oceanographic sampling is sparse
 Research vessels and moorings are small dots on a
map covering 67% of the Earth’s surface
 Satellites give global coverage but only of the surface
waters
 Oceanographic measurements are extremely
expensive to collect and unrepeatable as the
environment is constantly changing
 These factors combine to make sharing data
an attractive proposition
PV2005 Conference, Edinburgh
Culture
 Consequently, there is a long history of
national and international
oceanographic data management and
curation
 The Intergovernmental Oceanographic
Commission (IOC) founded International
Oceanographic Information and Data
Exchange (IODE) network in 1961
PV2005 Conference, Edinburgh
Culture
 IODE system forms a worldwide service
oriented network of
 DNAs (Designated National Agencies)
 NODCs (National Oceanographic Data Centres),
RNODCs (Responsible National Oceanographic Data
Centres)
 WDCs (World Data Centres – Oceanography)
 During the past 40 years, IOC Member States
have established over 60 oceanographic data
centres in as many countries
 This network has been able to collect, control
the quality of, and archive millions of ocean
observations, and makes the data available to
Member States.
PV2005 Conference, Edinburgh
Culture
BODC is the UK’s NODC in the
IODE network
Formed as the Marine Information
Advisory Service Data Banking
Section in 1979
Established as BODC in 1989
PV2005 Conference, Edinburgh
Culture
Data curation is high on the agenda
in other environmental science
domains in the UK
BODC is one part of a network of
designated data centres
maintained by the Natural
Environment Research Council
(NERC)
PV2005 Conference, Edinburgh
Culture
BODC is NERC’s designated data
centre for marine science
Responsible for assuring the longterm stewardship of NERC-funded
oceanographic data
Currently has 35 staff hosted by
the Proudman Oceanographic
Laboratory on the Liverpool
University campus
PV2005 Conference, Edinburgh
Physical Security
Three parts to BODC’s physical
security strategy for data
 Accession system
 Backup system
 Media preservation
PV2005 Conference, Edinburgh
Accession System
Incoming data transferred to BODC
Archive
Files copied as is with lowest
common denominator (usually flat
ASCII) conversions where
appropriate
Metadata record describing what
was submitted, when and from
whom
PV2005 Conference, Edinburgh
Backup System
 Comprehensive backup of file systems
and Oracle database to DLT
 Managed combination of full and
incremental backups
 Procedures to guard against backup
media degradation
 Tapes stored in fire safes at two
locations on campus and one external
location
PV2005 Conference, Edinburgh
Media Preservation
BODC have received data on
 Paper (cards, listings and tape)
 Magnetic tape (7-track, 9-track, QIC,
DLT)
 Floppy disks of many sorts (8”, 5.25”,
3.5”, 3”)
 CD-ROM and DVD-ROM
 Optical disks
 FTP and e-mail
PV2005 Conference, Edinburgh
Media Preservation
 Data from all incoming media are
transferred to the BODC archive
 Archive physical media have been
 9-track magnetic tape
 Phase-change optical disk
 Magneto-optical disk
 Currently use spinning disk (backed up
to DLT) and a mirrored tape robot mass
store
 Four physical media migrations of
exponentially increasing data volumes
PV2005 Conference, Edinburgh
Adding Value to Data
 ‘High’ volume data are harmonised into a
single format (NetCDF subset) from over 300
formats submitted
 ‘Low’ volume data are integrated into unified
schema in an Oracle database
 Data are visually inspected as graphical
presentations by scientifically skilled staff
 Problem data are marked by manipulating
quality control flags tagging each data value
PV2005 Conference, Edinburgh
Adding Value to Data
 What do these procedures achieve?
 Syntactic harmonisation and data
integration ease data synthesis in a poorly
standardised data culture
 Flagging facilitates data recycling
 Exposes ‘hidden’ data quality issues facilitating
reuse with no reference to the data originator
 Allows data to be used for purposes other than
those for which it was collected – e.g. outliers are
tolerable when determining means, but not
extremes
PV2005 Conference, Edinburgh
Adding Value to Data
 Lost (or hidden) data have no value
 NERC’s marine data are available through
BODC for reuse to:
 The UK marine science community
 The IOC IODE data centre network
 The European SEA-SEARCH community
 Further enhancement through database
federation under development
 Across NERC through NERC DataGrid
 Across European data centres through
SeaDataNet
PV2005 Conference, Edinburgh
Metadata Enhancement
BODC’s data stewardship mission
is to facilitate data recycling on a
decadal timescale with no need for
consultation with the originator
This is essential to provide reliable
baselines for the quantification of
Global Change
…and it can only be achieved
through metadata enhancement
PV2005 Conference, Edinburgh
Metadata Enhancement
Metadata content
 BODC metadata content model
specifies minimum requirement for
description of oceanographic data
 Model is populated by collation of
information from multiple sources,
including resolution of contradictory
information
 Missing information is vigorously
pursued
PV2005 Conference, Edinburgh
Metadata Enhancement
Metadata content
 Result is a guaranteed minimum
standard for metadata content
 This minimum standard is exceeded
wherever possible by compiling all
available information into data
documentation
PV2005 Conference, Edinburgh
Metadata Enhancement
Parameter Descriptions
 Parameter labels used by scientists
are often ambiguous, parochial or
obscure. For example, water column
temperature often labelled as ‘temp’
 Could also be air temperature or a
temporary variable to be thrown away
 Measurand label semantics are
surprisingly volatile – they usually get
buried in a notebook or committed to
memory and then forgotten
PV2005 Conference, Edinburgh
Metadata Enhancement
 Parameter Descriptions
 BODC maintains a parameter usage
vocabulary with over 17,400 entries
systematically describing measured
phenomena in the oceanographic domain
 Major investment in the dictionary through
the EnParDis project started developing it
towards international standard status
 Term definitions are now automated
concatenations of semantic model elements,
each constrained by a controlled vocabulary
 All originator parameter labelling is
replaced by terms from this dictionary
PV2005 Conference, Edinburgh
Achievements
 Context
 BODC specialises in observational
oceanographic data
 This is typically low volume, with
datasets usually sub-gigabyte
 However, it is extremely diverse
(hundreds or even thousands of
parameters per dataset) and is useless
without comprehensive metadata
 Days of work at sea frequently return only
handful of data values
 Every number therefore has to be
considered as a valuable commodity
PV2005 Conference, Edinburgh
Achievements
 Data available for recycling and
comparative studies include:
 Over 1,500,000 chemical and biological
measurements on water samples covering
over 5,000 different parameters dating back
to 1988
 Over 60,000 profiles of temperature and
salinity dating back to 1968
 Over 7,000 site years of sea level data dating
back to 1842
 Over 6,000 current meter records dating
back to 1967
PV2005 Conference, Edinburgh
Plans for the Future
 Problems
 The BODC content model was designed in
the late 1970s and whilst adequate it is ‘light’
by today’s standards
 Additional information is available in text
(XHTML) documents, but software agent
access is limited to free-text searching
 The desire for data interoperability with
other organisations is frustrated by
technical obstacles such as parameter
description diversity
PV2005 Conference, Edinburgh
Plans for the Future
 Solutions
 Augment the BODC metadata model to the
standard required to populate the NERC
DataGrid metadata model
 Migrate from plaintext documentation to
structured metadata constrained by
standardised vocabularies
 Further develop the parameter descriptions
to address issues such as synonyms
 Develop parameter mapping ontologies to
support database interoperability and
database federation
PV2005 Conference, Edinburgh
Conclusions
 Data curation has been high on the
oceanographic agenda for over 40 years
 Data curation is high on NERC’s agenda,
producing an enclave where
preservation of data is the rule rather
than the exception
 BODC has made a significant
contribution to the curation of UK
oceanographic data
 We aim to continue and to further raise
our standards of service in the future
PV2005 Conference, Edinburgh
Download