From HARM to xxx The ESA’s Proposal for a Standard Archive Format for Europe Gian Maria Pinna, Vincenzo Beruti ESA Stephane Mbaye, Mathias Moucha Gael Consultant Valter Spaventa, Davide Castellazzi ACS S.p.A. PV-2005 21-23 November 2005 LTA Strategy and Rationale The mandate for ESA’s EO missions is to maintain the archive for at least 10 years after the end of the mission The management of its vast amount of heterogeneous datasets poses problems for their long-term preservation The ESA’s EO Ground Segment Department has recognized since many years the need to establish a clear strategy for the management of the long-term archives. LTA Strategy and Rationale This strategy has as its main goals: archive maintenance in order to ensure data integrity archive and data management rationalization data conversion to new technologies in order to reduce cost of operations enhancement of data access standardization of the format in which the datasets are preserved Achievements aim at simplifying the exchange and interoperability of data between ESA and external operators Long-Term Preservation Challenge The data to be preserved for the long term require a special attention, which reflects in costly operations for their exploitation and maintenance: datasets have to be regularly converted into new media technology, to prevent the problems created by their obsolescence being the long term archive normally based on datasets with a very low processing level, higher level products have to be generated by processing systems in the case of distribution of the data holdings directly from the archive, it is normal practice that they are converted or reformatted into a format more oriented to the end-user utilization it is common to have to extract from the archive a portion of the dataset (subsetting) to create a “child” product more and more the data are distributed to the end-users, and exchanged among data holders, in electronic format over network infrastructures (private Intranets, public Internet, academic networks, etc.) Cost of LTA Operations One of the reasons that contribute to the high operations and maintenance costs of the long term archives is the excessive proliferation of diverse and heterogeneous data formats, caused by mainly three reasons: the lack of an agreed standard in the Earth Observation community, reason for which the formats tended to be specific for the sensor(s) each missions carried on board legacies from old ground segments architectures, which tended not to reuse previously developed elements the non-mature status until recently of the information technologies and standards used to describe and package the data, preventing the creation of a unique format able to satisfy at the same time the requirements for the long-term data preservation and their handling in the processing centres HARM (Historical Archive Rationalization and Management) HARM aims at achieving the following main goals: analysis and definition of a new ESA standard long term archive data format SAFE conversion of archived data from the original format into the new format and transcription onto the ESA’s automated long term archive, based on the ESA’s MMFI (Multi Mission Facility Infrastructure) architecture rationalization of the archives by centralization of historical data (non-active mission) and archive of on-going missions into specialized centres optimization of the archives, reducing operator intervention for system maintenance and improvement of archive capacity through removal of redundant data (whenever applicable and due to orbit overlap, duplicated transcriptions etc.) generation of the related metadata and inventory information, to be ingested in the centralized ESA catalogue HARM activities The operational plan for the HARM project is based on four logical phases, some of them overlapping loading of the data from loose media into the AMS in the original data format, in order to avoid waiting for the SAFE format consolidation, minimize the processing applied and maximize the usage of existing hardware (media drivers and stackers) in order to reduce as much as possible the operator’s intervention conversion of the datasets into the SAFE format, using the AMS library (fully automated operations). This phase can be contemporary and concurrent with phase 1 concentration of homogeneous data (same mission, sensor and mode) into specific locations, in SAFE format orbits stitching at the target archives of the datasets already fully converted in SAFE format, with regeneration of browse and metadata for the ESA’s central catalogue. ESA’s goal is to make SAFE: Compliant with ISO 14721:2003 OAIS (Open Archival Information System) reference model Compliant with emerging CCSDS/ISO XFDU (XML Formatted Data Units) packaging format Designed to act as a standard format for archiving and conveying data within ESA Earth Observation archiving facilities Although the primarily goal of SAFE is to handle EO data with processing levels close to the usually called “level 0”, no a-priori limitations exist for the packaging of higher level products SAFE Information Model Product Information Acquisition Period 0..1 Platform/Sensor Identification 1 Product History 1..* SAFE Product Product Data/Metadata Metadata Orbital Information 0..1 EO Data * 1..* Geolocation Information 0..1 Quality/Fixity Information 0..1 Representation Information 1 SAFE Product Information Acquisition period: the acquisition period metadata that provide the time extents of all the data contained in the SAFE product Platform/Sensor identification: the platform identifies the system (satellite/aircraft) that acquired the EO data wrapped by the SAFE product Product History: a processing log collecting the historical information dedicated to the maintenance and the traceability of the product SAFE Product Data/Metadata Orbital information: the reference to the trajectory of the platform that acquired the data. This information may locate one or several orbit paths, the corresponding cycle, track, etc. Geolocation information: the information locating the product footprint on the Earth’s surface, either as a series of tie points or by reference to a world reference system of the acquiring platform. The Geolocation information, may also attach additional information to each localized element, including cloud coverage vote, meteorological information, etc. Quality/Fixity information: information about the quality of an EO dataset. SAFE makes use of techniques (i.e. XPath) that allows the precise location of the corrupted or missing elements up to the bit level Representation information: any data contained in a SAFE product shall be accompanied with its representation information formally and numerically exploitable SAFE Logical Model A SAFE product is a logical tree of “Content Units” forming the so-called “Information Package Map” The root Content Unit has predefined associations to the information applicable to the overall product, i.e. at least the “Acquisition Period”, the “Platform/Sensor Identification” and the “Product History” The structure of the children Content Units is less constrained and depends mainly of the logical view of the wrapped data. In most cases, one Content Unit matches one EO dataset and its accompanying metadata. Several Content Units may, however, share the same metadata. SAFE Logical Model - Acquisition Period - Platform/Sensor Identification - Product History SAFE Product Content Unit SAFE Product Metadata Object 1 * * 1..* SAFE Content Unit 1..* * Data Object 0..1 - Orbital Information Geolocation Information Quality Information Quality/Fixity Information Representation Information SAFE Physical Model Contains: - Information Package Map (Content Units) - Metadata Objects - Acquisition Period - Platform/Sensor Identification - Product History - Orbital Information - Geolocation Information - Quality Information - Quality/Fixity Information Manifest XML file 1 URL SAFE Product URL * Binary or XML files Referenced/external Metadata or Data Objects * XML Schema files Representation information is always provided as external XML Schema documents SAFE Physical Components Manifest file: an XML document conforming the XFDU Manifest file. It contains the definition of the Information Package Map, the wrapped Metadata Objects, the wrapped Data Objects and references to the external files containing the Metadata and Data Objects Binary or XML files: the data or metadata object contents. Currently, only two types of files have been identified, i.e. binary matching MIME octetstream definition and XML documents. Each of these files shall be accompanied with one or more XML Schema document controlling its content XML Schema files: the representation information of the data held by a SAFE product. In order to represent the binary information, SAFE also defines specific markups that annotate the XML Schema documents to provide information on the physical structure (SDF markups). Thanks to these specific annotations, the contents of the binary files are described up to the bit level with a common technique as for XML documents Preservation of Binary Objects Annotated XML Schemas Structured Data File (SDF) markups <xsd:complexType> <xsd:sequence> <xsd:element name="record" type="xs:int" minOccurs="0" maxOccurs="unbounded"> <xsd:annotation> <xsd:appinfo> <sdf:block> <sdf:occurrence query="../header/numRec + 1"/> <sdf:length unit=‘bit’>12</sdf:length> </sdf:block> </xsd:appinfo> </xsd:annotation> </xsd:element> </xsd:sequence> <xsd:complexType/> SDF Markups (1) SDF Markups (2) SAFE Volume Set Identification Volume Title PGSI-GSEG-EOPG-FS-05-0001 Volume 1 Core Specifications PGSI-GSEG-EOPG-FS-05-0002 Volume 2 Recommendation for specializations PGSI-GSEG-EOPG-FS-05-0003 Volume 3 Book 1 ENVISAT ASAR Products PGSI-GSEG-EOPG-FS-05-0004 Volume 3 Book 2 ENVISAT MERIS Products PGSI-GSEG-EOPG-FS-05-0005 Volume 3 Book 3 ENVISAT AATSR Products PGSI-GSEG-EOPG-FS-05-0006 Volume 3 Book 4 ENVISAT DORIS Products PGSI-GSEG-EOPG-FS-05-0007 Volume 3 Book 5 ENVISAT GOMOS Products PGSI-GSEG-EOPG-FS-05-0008 Volume 3 Book 6 ENVISAT MIPAS Products PGSI-GSEG-EOPG-FS-05-0009 Volume 3 Book 7 ENVISAT RA2 Products PGSI-GSEG-EOPG-FS-05-0010 Volume 3 Book 8 ENVISAT MWR Products PGSI-GSEG-EOPG-FS-05-0011 Volume 3 Book 9 ENVISAT SCIAMACHY Products PGSI-GSEG-EOPG-FS-05-0012 Volume 3 Book 10 ENVISAT Auxiliary Data Products SAFE Volume Set Identification Volume Title PGSI-GSEG-EOPG-FS-05-0013 Volume 4 ERS Products PGSI-GSEG-EOPG-FS-05-0014 Volume 5 Landsat Products PGSI-GSEG-EOPG-FS-05-0015 Volume 6 Terra/Aqua MODIS Products PGSI-GSEG-EOPG-FS-05-0016 Volume 7 NOAA AVHRR Products PGSI-GSEG-EOPG-FS-05-0017 Volume 8 SeaStar SeaWiFS Products PGSI-GSEG-EOPG-FS-05-0018 Volume 9 Nimbus CZCS Products PGSI-GSEG-EOPG-FS-05-0019 Volume 10 SPOT HRV(IR) Products PGSI-GSEG-EOPG-FS-05-0020 Volume 11 JERS SAR/OPS Products PGSI-GSEG-EOPG-FS-05-0021 Volume 12 MOS MESSR/VTIR Products PGSI-GSEG-EOPG-FS-05-0022 Volume 13 IRS-P3 MOS Products SAFE I/O API In addition to the Core SAFE specifications and the Specializations, the HARM project is implementing a set of software APIs dedicated to SAFE product users. The main use cases of these software components are: Create a SAFE product Open a SAFE product Remove a SAFE product Move a SAFE product considering or not the referenced objects Validate a SAFE product including or not the external object contents Add/Remove/Browse Content Units of the Information Package Map Add/Remove/Retrieve the Metadata Objects Add/Remove/Retrieve the Data Objects Browse/Select/Extract information from either Binary or XML objects Identify/Sort Quality Information to a specific part of the objects SAFE I/O API XFDU Java API: this API provides the general features common to all XFDU packages. This API has been developed jointly with the SAFE Java API and is presently used in the framework of the activity of the CCSDS Information Packaging and Registries (IPR) Working Group in support to the validation of the XFDU recommendation developed by the group XFDU C++ Wrapper: a C++ wrapper on top of the XFDU Java API. Most of the functionality is preserved from the Java API, apart the capability of browsing the binary contents Data Request Broker (DRB): from which SAFE inherits the capability to access data independently from their formats. DRB supports in particular the SDF markup language used for annotating the XML Schema documents acting as representation information of the SAFE product objects. SAFE Benefits SAFE is mainly devoted to the long-term preservation of data, as its full compliance with the CCSDS/ISO OAIS RM and XFDU standards reveal SAFE permits an easier and more effective migration of data to other future standards SAFE permits an easier reformatting to other formats, including end-user formats for products distribution SAFE will easy the maintenance of the SW that use the data, even historical, thus decreasing the chances of obsolescence and improving their usability SAFE Benefits SAFE being self describing, it greatly facilitates the format maintenance SAFE format documents are built automatically from the XML Schemas, so the maintenance of the SAFE format is more robust with respect to “old-fashion” formats SW that access SAFE-formatted datasets will benefit from the usage of the same SAFE XML Schemas for reading and writing the datasets. This greatly improves the overall cycle of creation/ingestion/quality control/ transformation/distribution of the datasets. http://earth.esa.int/SAFE/ (under construction)