PAWN: A Novel Ingestion Workflow Technology for Scientific Data Mike Smorul, Joseph JaJa,

advertisement
PAWN: A Novel Ingestion
Workflow Technology for
Scientific Data
Mike Smorul, Joseph JaJa,
Yang Wang, Mike McGann,
and Fritz McCall
Overall Principles
Distributed, secure ingestion
 Use of web/grid technologies – platform
independent
 Minimal client-side requirements
 Ease of integration with data grid systems.
 Designed to satisfy data integrity
requirements of scientific collections and
digital preservation

Producer
Producer Management Interface
Management
Server
Data Grid
Gateway
Producer data suppliers
Producer

Provides data to a data grid based on a prior
agreement.

Consists of a management/metadata server and
an ingestion client.

Provides initial arrangement, context, and
metadata.
Data Grid - receiving
Producer 1
Data Grid
Producer 2
Scheduler
Producer n
Bitstream Validation Service
Data Grid – receiving
Receives data from a Producer
 Validates bitstreams and metadata, and
sends acknowledgement to Producer.
 Arranges into collections and specifies
optional publishing and preservation
policy.
 Publishes bitstreams into data grid.

Data Grid – Long term Stewardship

Implemented using grid technologies.

Use the existing prototype
NARA/UMD/SDSC site.

Automated replication and integrity
checking.

Enforces access control and preservation
policy
Ingestion Workflow
1.
2.
3.
4.
5.
Negotiate Submission Agreement.
Workflow Initialization and Submission
Information Packet (SIP) creation.
Transfer of SIPs to Data Grid site.
Validation of SIP transfer
Organization of data into collections and
transfer into Data Grid.
Submission Agreement


Create machine actionable set of rules
describing items.
Final Submission Agreement is
composed of:

METS document for application defaults
 METS Constraint document to limit METS
form to submission parameters
METS Overview


Provides a framework for linking structural
organization of objects with metadata.
Using XML namespace, metadata from various
XML schema can be attached to objects
 Ie,


dublin core, FGDC, etc
Extensible for more complex metadata
http://www.loc.gov/standards/mets/
Sample METS Document
Metadata
Linking
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/
http://www.loc.gov/standards/mets/mets.xsd">
<metsHdr>
<agent ROLE="CREATOR">
<name>toaster@hostname</name>
</agent>
</metsHdr>
<fileSec>
<fileGrp>
<file ID="5" MIMETYPE="application/octet-stream" SIZE="67624" CREATED="2002-08-21T15:36:05"
CHECKSUM="2CE7D79E40BD6C6A65A6684B6FD3D08C" CHECKSUMTYPE="MD5">
<FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/GFS-contrib-5.1.tar.gz"/>
</file>
</fileGrp>
<fileGrp>
<file ID="7" MIMETYPE="application/octet-stream" SIZE="2517" CREATED="2002-09-06T17:06:07"
CHECKSUM="767185AA022180E701324C592E1C36E3" CHECKSUMTYPE="MD5">
<FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/gfs.out"/>
</file>
</fileGrp>
</fileSec>
<structMap>
<div ID="3" LABEL="iscsi">
Structural
<fptr FILEID="5"/>
<fptr FILEID="7"/>
Organization
</div>
</structMap>
</mets>
Why METS Constraints?

METS doesn’t provide a way to create
machine interpretable rules describing a
collection
 Ie:
allow only TIFF files in certain structural
areas

METS profiles allow for developer
interpretable rules, not machine
interpretable
METS Constraints
Allows structural, metadata, and file
constraints.
 Structural Constraints:

 Restrict
child div’s and restrict pointers to div,
file, and other mets documents

File Constraints:
 Restrict

files by mime-type or validation tests
Metadata Constraints:
 Restrict
allowed metadata schema.
METS Constraints - Template
<?xml version="1.0" encoding="UTF-8"?>
<mets …. >
<!-- validation test section, referenced in the constraints document -->
<amdSec>
<techMD ID="xmltest">
<mdWrap MDTYPE="OTHER">
<xmlData>
<val:validation NAME="xmltext" DESCRIPTION="Test for valid xml documents" MIMETYPE="text/xml">
<val:valgrp required="true">
<val:valtest name="gif" required="true">
<val:description>generic gif test for any file</val:description>
</val:valtest>
</val:valgrp>
</val:validation>
</xmlData>
</mdWrap>
</techMD>
</amdSec>
<!-- base div structure to use for all clients -->
<structMap>
<div ID="ID1" LABEL="Research & Development Records">
<div ID="ID1.1" LABEL="Research & Development Project Records">
<div ID="ID1.1.1" LABEL="R&D Project Case Files"/>
<div ID="ID1.1.2" LABEL="R&D Record Series"/>
</div>
</div>
</structMap>
</mets>
METS Constraints - Rules
<?xml version="1.0" encoding="UTF-8"?>
<metsconstraint …>
<filegrp ID="FILE1" NAME="Text Document">
<!-- Files can be identified either by MIMETYPE, or TESTID in skeleton METS document or both -->
<file NAME="html document" MIMETYPE="text/html"/>
<file TESTID="xmltext" NAME="xml document" MIMETYPE="text/xml"/>
</filegrp>
<!-- Apply rules to predefined div's and link to required file/metadata tests above -->
<divrule DIVID="ID1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/>
<divrule DIVID="ID1.1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/>
<divrule DIVID="ID1.1.1" RESTRICTMPTR="true">
<filetype FILEGROUPID="FILE1"/>
</divrule>
<divrule DIVID="ID1.1.2" RESTRICTMPTR="true"/>
</metsconstraint>
Ingestion Workflow
1.
2.
3.
4.
5.
Negotiate Submission Agreement.
Workflow Initialization and
Submission Information Packet
creation.
Transfer of SIPs to Data Grid site.
Validation of SIP transfer
Organization of data into collections and
transfer into Data Grid.
Initialize Ingestion workflow
Instantiate Producer management server
to track registered objects
 Establish a working trust relationship with
the Data Grid
 Issue clients.

Create SIP

Each client registers objects stored locally
with producer management server
 Register
file types, validation tests, etc
 Client follows rules in Submission Agreement

Producer-wide agents can arrange
registered object to give a broader context
SIP Example
OAIS Information packet
Content Information
· Physical Object
· Representation
Information
Preservation Description
Information
·
·
·
·
Provenance
Fixity
Reference
Context
Descriptive
Information
Packaging Information

Submission packet is designed to contain
a self describing set of metadata that is
self-validating
Client Interface
Ingestion Workflow
1.
2.
3.
4.
5.
Negotiate Submission Agreement.
Workflow Initialization and Submission
Information Packet creation.
Transfer of SIPs to Data Grid site.
Validation of SIP transfer
Organization of data into collections and
transfer into Data Grid.
Transfer SIP to Data Grid
Retrieve previously registered SIP from
producer management server
 Authenticate to data grid
 Update tracking information with new
location of files in data grid
 Data Grid acknowledges transfer
completion to producer management
server

Ingestion Workflow
1.
2.
3.
4.
5.
Negotiate Submission Agreement.
Workflow Initialization and Submission
Information Packet creation.
Transfer of SIPs to Data Grid site.
Validation of SIP transfer
Organization of data into collections and
transfer into Data Grid.
Validation of SIP transfer
Check incoming SIP against constraints
documents.
 Ensure object integrity by verifying
checksums/cryptographic digest
 Validate bitstreams against necessary
tests
 Record validation results

Ingestion Workflow
1.
2.
3.
4.
5.
Negotiate Submission Agreement.
Workflow Initialization and Submission
Information Packet creation.
Transfer of SIPs to Data Grid site.
Validation of SIP transfer
Organization of data into collections
and transfer into Data Grid.
Final transfer to Data Grid
Transfer objects to Data Grid
 Update tracking information with new
location in Data Grid
 Transfer log of data activity into data grid
 Return accept/reject messages to
producer metadata server

Component Overview
Data Grid Management Interface
Producer Management Interface
CRL check
Success/Failure notification of ingestion
Metadata registration/retrieval
ra n
IP t
Data Grid
sfer
S
Bitstream Validation Service
Producer data suppliers
Producer Components
Database to track registered objects
 Certificate Authority management

 Web
service for receiving side security
callback
Management server supplies web service
interfaces to ingestion clients and
management operations.
 Clients are designed to be standalone,
with security certificates issued by
producer

Receiving Components
Receiving servers validate connecting
clients and validate SIPs
 Validation Services are simple webservice
calls.
 Abstract I/O layer into data grid.

Recap
Implemented using web technologies
 Architecture independent
 XML based metadata

 METS
based SIPs
 Add-on constraints describing Submission
Agreement

Target release dates:
 Beta: April
 Release:
June/July
More Information

ADAPT website
 http://www.umiacs.umd.edu/research/adapt

Papers
 Scalable,
Reliable Marshalling and
Organization of Distributed Large Scale Data
Onto Enterprise Storage Environments
 PAWN: Producer - Archive Workflow Network
in Support of Digital Preservation
Download